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AB STB ACT 

This dissertation examines the relationships between 
a document being indexed and the index terms assigned to 
that document in an attempt to quantify the extent of 
"machine -like” indexing occurring when librarians and 
scientists index technical text. A number of possible 
relationships between the ^ iXt and the index assignments 
are predicated and tested ‘ .h two models: a multiple 
linear regression model ana a Boolean combinatorial 
model. It is concluded that indexers in general do 
not index technical text in a "machine-like" fashion 
and that neither model is useful as a general predictor 
of human indexing. 
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abstract 

A Study and Model of Human 
Indexing Behavior 

Caryl McAllister 

This dissertation examines the relationships between a 
document being indexed and the index terms assigned to that 
document in an attempt to quantify the extent of "machine- 
like'* indexing occurring when librarians and scientists index 
technical text. 

A number of possible relationships between the text and 
the index assignments are predicated, and tested with two 
models; a multiple linear regression model and a Boolean 
combinatorial model. The models test two classes of 
relationships for the best relationship in that class. Both 
models find and correlate textual evidence in the document 
for a given index term with the descriptors assigned by the 
indexers. In all, some sixty types of textual evidence (or 
clues) are considered. 

For the experiment twelve indexers were divided into two 
groups of six each; professional librarians and engineers or 
scientists. , Each subject indexed ali twenty sample 
documents. There was a significant difference between the 
amount of librarian indexing and the amount of 
engineer/scientist indexing accounted for. Although the 
difference was not great, the engineers and scientists proved 
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to be less predictable than the librarians on the basis of 
the textual clues. 

Over the entire sample of documents and for all indexers, 
the regression model accounts for about 30% of the indexing. 
For a single document, however, as much as 40 to 80% of the 
indexing can be explained by the regression model. The 
location and type of textual clue deemed important by the 
indexers varies considerably from document to document. 
Hence variations in clue "style” among documents lowers the 
overall percentage because the entire sample is a compromise 
position for all the documents. 

Regressions run on four single indexers show a very small 
correlation between clues and xndexing ranging from 7 to 22%. 
Individually the indexers are less predictable then the 
group. 

The information from the Boolean combinatorial model is 
less comprehensive primarily because not enough computer time 
was available for a full development of the model. Based on 
a one-third sample, the model correctly predicted about 65% 
of ail indexing decisions. No other combinatorial runs were 
made. 

It is concluded that indexers in general do not index 
technical text in a ”machine-like" fashion and that neither 
model is useful as a general predictor of human indexing. 
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1. Introduction 

The information explosion is r. widely recognized 
phenomenon. Increasing numbers of people engaged in researr’’ 
have produced increasing numbers of papers reporting that 
research. libraries, engaged in the business of making that 
research available on demand, must process increasing numbers 
of such documents. This processing remains a major library 
bottleneck. 

In addition to the investment In clerical labor and 
paperwork to acquire ,a document, a library often must also 
spend professional labor indexing it. This indexing makes it 
possible for patrons to find the particular items they want 
in a large collection without having to read the entire 
collection. The document and the index entries for a 
document are stored in some convenient place so that someone 
wishing to use the library or information center may search 
the indexes to locate it. 

Tvo tools have been developed to aid indexers: indexing 
rules and lists of approved index headings. While both rules 
and headings are commonly available to aid in author 
indexing, subje'^ct indexing is quite Another story. Here, 
lists of approved headings (also called thesauri) are 
plentiful, but there are only vague and imprecise notions of 
how an indexer should go about choosing the most appropriate 
headings out of the approved list for the document at hand. 
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Even though a large p 
resources are devoted to 
people do subject inde 
conjecture and only a li 
long been arguments in 
requirements for indexers 

m 

copy words from the 
graduate-level subject ex 
hand, if indexers are i 
decision-making, we shoul 
substituting machines for 



art of a document retrieval system’s 
this task, the question of how 
Xing has been the subject of much 
ttle experimentation* There have 
the literature about the educational 
. If indexers do little more than 
document, we shouldn’t be paying 
perts to do the job. On the other 
nvolved in some rather sophisticated 
dn’t be talking so glibly about 
people. 



Only in biomedicine has anyone attempted even a partial 
answer to the question of how people go about indexing. Yet 
none of the biomedical studies has been conclusive enough to 
answer the question even for that particular field. And no- 
one has tried the experiment for less idiosyncratic 
literature than medioine. 

For some time, • researchers interested in automatic 
indexing have been proposing that machines should choose 
index terms on the basis of machine-recognizable textual 
clues present in the text. Such clues as noun phrases, word 
frequency or location, word stems and synonyms have been 
suggested. If textual clues account for a large part of a 
human indexer’s behavior, then it might be feasible to 
automate indexing. And if this behavior can be modelled^ the 
model could form the basis for just such an automatic 
indexing system. If, on the other hand, mechanically- 
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recogn iza bl^a clues do not account for a large part of a human 
indexer’s behavior, automatic indexers would have to go 
beyond simple te.xtual clues to do human-like indexing. 



Because of the strong interest in machine-recognizable 
textual clues for automatic indexing, because of the numerous 
suggestions that human indexers do little more than word 
matching, and because a very large proportion of any 
reference retrieval system's budget is invested in indexing, 
this thesis attempts to answer the question? To w hat exten t 
do ma ch i ne~r ecognizable tex tu a 1 cl u es ac count fo r human 
index er b€-havi or? 



To highlight the influence of training on indexing, we 
use indexers of two kinds; librarian- indexers , who by 
training eind experience ought "to know how to go about 
indexing, and sclent 1st -indexers , who by training and 
experience ought to be most familiar with the subject .matter 
to be indexed. Differences in indexing behavior between the 
two groups are of interest. We are also interested in the 
textual clues themselves and attempt to isolate those clues 
which contribute most to the explanation of human indexing. 
To do this effectively, a large number of clues and selection 
rules are covered systematically. 



Chapter 2 reviews previous studies of human indexing and 
the indexing rules that have been suggested for automatic 
indexing. Besides surveying commonly quoted human rules, 
this chapter points out that rules used by humans 





are not, in 
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fact, rules but general behavioral guidelines. The 
discussion of previous models of human indexing behavior 
points out the strengths, and weaknesses of these studies. 
The analysis of rules used for automatic indexers shows the 
variety of rules discussed in the literature and suggests the 

' r 

types of textual clues which should be accounted for in the 
indexing models. The textual clues and the assignment rules 
used in the ;o models are discussed in this chapter. 

The - 1,0 rodels developed in the thesis are presented in 
Chapters 3 rn- 4. The first is a multiple linear regression 
model chosen for its statistical and predictive properties. 
The second is a combinatorial model which is used to test 
many of the clu£*s summarized i,n the second chapter. Each 
model has advantages and disadvantages. Taken together, they 
complement each other. Both models quantify the extent to 
which machine -recogni zable textual clues account for indexer 
behavior. Either can act in a predictive manner. Chapter 3 
presents the regression model, statistical tests for 
regression and the computer program used for regression. 
Chapter 4 presents the same information for the combinatorial 
model. 

Chapter 5 discusses the experimental procedures and gives 
descriptive information about the experimental samples. The 
computer programs written to obtain and analyze the data and 
to calculate results ate presented in some detail in this 
section. 
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Tjie conclusions of "the fhesis and suggestions for further 
research are in Chapter 6, 
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A REVIEW OP INDEXING RULES 
TOIR HUMANS AND MACHINES 
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2. A Review of Indexing Rules for Humans and Machines 



2.1 Rules for Indexers 



For the purposes of this discussion we must distinguish 
between a Procedure and a Guideline. A Procedure is a set of 
exact and detailed rules which invariably lead the perfori er 
to the same outcome provided he is gi'.en the same input. 
Host computer programs are Procedures because, given the same 
input they operate on this input in exactly the same way each 
time to produce exactly the same output. The performer of 
the Procedure need not be a machine, however. Suppose I give 
you instructions for getting to my house from San Francisco. 
These instructions might include taking certain roads, 
turning in a specified direction at certain intersections, 
-and so forth. If you follow these instructions, then you 
will arrive at my house. There is a guarantee that if the 
directions (the procedure) are followed, the result (arriving 
at my house) is assured. Of course, there is no guarantee 
that everyone arriving at my house has followed the same 
directions to get there. 



In contrast, Gui 
Guideline is a set of 
detailed enough to in 
outcome even when give 
might tell you to; 
the freeway: watch for 
tell you to watch for 



delines have no guarantee 
warnings or cautions wh 
variably lead the performe 
n the same input. For 
head South; if speed is es 
signs; use a map. Thes 
signs, but don’t say which 



d outcome 
ich are 
r to the 
inst ance 
sential , 
e Guidel 
signs. 
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not 
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take 
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suggest that a map might be helpful, but don't t 
how a map is to be used or how it migh“ 

Guidelines for getting to my house ^-ron't gunrar 
and they certainly won't guarantee that everyon<^ 
will get to my house in the same way. 

Let us draw the analogy to human indexing. lenti.n is 
made in the literature of "indexing rules". Thes 5 rules are, 
in fact. Guidelines, not Procedures. They do not guarantee 
that anyone who rollows them will arrive at the satre index 
set. Proof of this iRay be found in indexing consensus 
studies (Hooper (1965), St. Laurent in'366)) where the same 
instructions, thesaurus vocabulary and documei ^s almost 
invariably lead to different index sets when used by 
different indexers or even the same indexer at different 
times. We will review some of these indexing guidelines here 
because they are important for understanding how indexers go 
about their task. 



1 exactly 
3 helpful. 

arrival 
ising them 



Baser! on experience with chemical literature in an 
industrial company. Card Penn (1962) outlines indexing as 
the search for answers to four groups of questions. Penn 
says the indexer first asks "What information is in this 
document, how is it organized, and into how many intellectual 
components is it subdivided?" No procedure is given for 
deciding what constitutes "information", but Penn suggests 
that the indexer read the most condensed document statement 
first (the title) , and then work toward the most narrative 
(the abstract, then paragraph headings, and finally the full 





10 



document) . This suggestion is, in part, a procedure because 
it tells the indexer where and in what order to look. It 
does not, however, tell the indexer what to look for or when 
to stop looking. 



The second question is: ”How are the overall document 

and each of its component subdivisions related to or 
identified with the current and anticipated activities of the 
users?” This is an identification of the information from 
the point of view of the user as well as the author. The 
terminology of the author is put into relationship with the 
accepted terminology of the user group. Bu c the terms 
"component subdivisions", and "current and anticipated 
activities of the users" are not defined, nor are any 
instructions given for finding out just what these current or 
anticipated activities might be. In order to estimate 
potential usefulness, indexers would have to estimate the 
likelihood that a project might be undertaken. But to expect 
indexers to predict the course of scientific investigation is 
to turn them into managers of scientific projects. This 
second rule, therefore, serves primarily as a warning to 
indexers that the needs of the users are an important factor 
in a reference retrieval system. 



orig 

that 

will 



The next question isi "How new, how reusable or how 
iiial is the information in each component?" Penn argues 
if the indexers cannot judge which information the users 
consider new and int erestiv’^ g, then the indexing depth 
great or too shallow. This rule requires that 





will 



be too 



the indexer knov the state of knowledge of present and future 



system users, 
it points to a 
indexing the 
is and what is 



This is an obviously impossible condition, yet 
common-sensd notion that indexers shouldn’t be 
obvious. The difficulty lies in deciding what 
not obvious. 



To answer the last question: "How should information be 
described?" the indexer rephrases the mental picture of the 
document into descriptors from the thesaurus. If, indeed, 
the. document is understood, then the indexer does have some 
idea of what the author is saying ~ he has a mental picture 
of the subject (s) of the document. But this does not assure 
that two indexers will have identical mental pictures ncr 
does it assure that the interpretation of this mental picture 
into index terms will produce identical results. 

In general, then, Penn’s rules are cautions to warn the 
indexer that the subject content of a document, the 
activities and the subject expertise of the users, and the 
thesaurus vocabulary of the system are important and should 
be considered when indexing. ' But these cautions do not 
consitute a Procedure. , 



Other published indexing rules are similar to Penn’s. 
Bernier (1965, 326) suggests the following: 1) choose to 

index those subjects which are novel, emphasized^ or 



extensively 


reviewed , 


2) in aex 


to 


the maximum 


specificity 


w arrant ed 


by the 


author. 


3) 


choose those 


terms most 


f re quentl y 


used in the 


field. 


4) 


provide guidance (cross- 
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references) among headings and from synonyms^ 5) check all 
index entries for accuracy, 6) use modifying phrases to make 
subject terms more specific and to provide better guidance. 
Again, these are cautions to the indexer about subjects that 
are novel, the maximum level of indexing specificity, etc. 
Bernier’s rules substantiate the fact that so-called indexing 
rules are not procedural. Rees (1962) and MacMillan and Welt 
(1961) agree. 



We have pointed out the vagueness and imprecision 
inherent in the indexing "rules” to be found in the 
literature. The business of indexing is no more procedural 
when seen from a philosophical point of view. Wilson (1968) 
discusses several ways one might determine the subject of a 
document. Fcr instance, an indexer might list, sentence by 
sentence, what a document was about. The list could 

justifiably include the names of the objects mentioned in 
each sentence, or the names of the concepts employed by the 
author in expounding on his subject, or the names of the 
things or individuals indirectly referred to, or any 
combination of these. While it is possible to recognize 
obviously wrong entries on this list, knowing what is 
obviously wrong does not resolve the many occasions when 
indexers can differ considerably in acceptable indexing 
assignments. Wilson’s arguments point out, once again, that 
indexers are operating with Guidelines. 



In conclusion then, we have seen t '^it the indexing rules 
profess to use are not Procedures, but Guidelines, 
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humans 



but 
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Indexing rules may give general guidance; they do not 
constitute a how-to-do-it course. According to the dictates 
of the indexing profession, indexing is an art, not a 
science. Consciously, at least, human indexing involves a 
great deal of judgement, subject ©■“•pertise, knowledge of the 
users and of the document retrieval system. None of these 
things is easily- automated by present day standards of 
artificial intelligence. 

This is not to say that we cannot use a Procedure to 
mimic human indexing. As the next section demonstrates, what 
indexers do and what they say they do may be quite different 
things. 





m 



2. 2 "Results of Human Indexing 

Instead of investigating what indexers sa^ they do, some 
experimenters have tried to find out what indexers da by 
looking at the index sets produced. Studies of this kind 
cannot claim to have investigated the paths indexers used to 
arrive at a particular index set. However, possible 
hypothetical mechanisms for reaching a particular index set 
can be investigated and the outcome of these artificial rules 
can be compared with the outcome of human indexing. 

Pels and Jacobs (1963) were interested in the extent to 
which indexers became "linguistically creative" when 
indexing. They defined three sources of indexing terms: 1) 
words occurring in the text, 2) synonyms for text words, and 
3) paraphrases of the text. These types of index terras are 
increasing3.y "creative". Using random samples taken from 
state and federal statutes, straight term selections 
constituted 63 to 91% of the index set, synonym substitutions 
ranged from .5 to 5,8% and paraphrases from 7.4 to 33.7%. 
The statistics quoted indicate that legal indexers, at any 
rate, are not particularly creative linguistically. Note 
that although this study indicates where the indexing words 
came from, it does not indicate how the indexers arrived at 
particular index entries. 

A study by Montgomery and Swanson (1962) strongly 
substantiates Pels and Jacobs. They chose subject headings 
at random from Inde x M edicus . Each of the titles indexed 

2'1 
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under each of these headings was compared with the heading 
itself. In 86% of the cases the sijbject heading or a synonym 
for it, appeared in the title. 

O'Connor (1964) later disagreed with the Hontgomery- 
Swanson study. He argued that it ignored subdivisions of 
subject headings and used synonyms inconsistently to obtain 
the high degree of matching. To substantiate his points, he 
tried the Mon tgomery-Swanson indexing rules on titles from 
three medical indexing systems. Based on samples of 50 
titles from each of the systems, the heading-title 
correlation in these samples ranged from 19 to 45% in the 
first system, from 40 to 68% in the second, and from 13 to 
39% in the third. This is in sharp contrast to the 86% 
agreement obtained by Montgomery and Swanson. At least as 
far as medical text is concerned, there is little agreement 
on the profitability of using title words and their synonyms 
as an artificial procedure for imitating human indexing. 



A few studies have been made of indexing in engineering. 
Slamecka and Zunde (1963) found '80% of the humanly— assigned 
index terms in the abstracts of 30 documents from Sc ien tif i c 
a_nd Technical Aerospace B eports . Bottle (1970) compared the 
titles of articles with- humanly-assigned subject headings for 
each article. Titles were chosen from Appli ed Scie nce and 
T e chnolo gy, BriJ^sh Tec hnology I nde x and Eng inee ring Index . 
From 48 to 68% of the titles either matched the assigned 
heading or contained a syntactic variant or a synonym for it. 
Graves and Helander (1970) compared titles and abstracts 
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taken from P etrol e um A b s tracts with the humanly-assigned 
index terms. Exact and synonym matches accounted for 40% of 
the hu manly -a ssign ed index terms. Although each of these 
s'!" ies was in the same general subject area of engineering, 
the percentage of human index terms accounted for ranged from 



40 to 80%. 



The studies discussed up to this point investigated 
possible mechanisms for arriving at the same indexing humans 
produced. All of the studies worked from the already- 
assigned index set backwards to the text. In effect, this 
approach covers only half of the problem. It accounts for 
where the index term came from; it provides a textual 
justification for the assignment of each index term. But it 
does not tell how many matches with other subject headings 
might have occurred. For instance, suppose the title *'Real- 
time Input Preprocessor for a Pattern Recognition Computer” 
were compared with the subject heading "Pattern recognition". 
There is an exact match between the subject heading and a 
portion of the title. This would be counted as one instance 
of an exact thesaurus-title match in the studies discussed 
above. But this same title also matches two other subject 
headings: "Real-time computer systems" and "Input 
preprocessors for computers". These matches were ignored by 
the above studies. Although these studies kept track of the 
index terms or subject headings which we re assigned, they did 
not try to explain why other terms were not assigned. Both 
explanations are required in a complete model. 
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In later experiments, O’Connor tried several methods for 
obtaining mannallY assigned index terms from full document 
text (1961, 1962, 1965J. He chose two index terms, 
•toxicity* and ’penicillin*, from the thesaurus of an 
operational 10 , 000-document system. He then tried to 
formulate rales for assigning the documents to the 
appropriate subject heading without assigning ether documents 
in the collection to that subject heading. In the end, ei 
quite complicated indexing rule was formulated for each 
thesaurus term. These rules, while assigning ’toxicity* to 
most of the toxicity papers, also assigned ’toxicity* to non - 
toxicity papers. To counteract the over -assignment , without 
causing concomitant under-assignment, O’Connor used minimum 
frequency requirements, location of the toxicity clue in 
specific parts of the document, etc. 

The rules formulated on an initial group of toxicity 
papers were then tested on a second group of papers from the 
same system. They correctly selected 92% of the toxicity 
papers, at the cost of over-assigning ’toxicity* to 18% of 
the non- * toxicity * papers. The' computer-simulated rules were 
comparable with the system *s regular human indexers who 
correctly assigned aboii.t 80% of the toxicity papers with a 2% 
over- assignment . 

A similar procedure was followed for the term 
* penicil3-i n’ resulting in another set of simulated computer- 
assignment rules. These rules correctly selected 97% of the 
penicillin papers at the cost of a 4% over-assignment. In 

O 
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contrasty indexers correctly assigned about 75?S of the papers 
and over-assigned less than 29?. 

Although the artificial indexing rules O’Connor devised 
work quite well for •penicillin* and ’toxicity*, there are 
some difficulties with his scheme. First, because each 
thesaurus term requires a different rule, the invention, 
programming and use of such rules for a real-life thesaurus 
(say, 20,000 terms) is almost a practical impossibility. 
Second, the two sample nfex terms selected for study were 
both single words, and rare posted on a rather high 
proportion of the coll- ~ tion * s drriments (1500/10,000 for 
toxicity and 700/10 ,000 fzz > enicillin) . Such heavy posting 
is most unusual and ocriz 3 on fewer than 2 or 3 percent of 
the terms even in very larca collections (Houston and Wall 
(1964)) . Third, the study was done on biomedical literature 
which typically has a well defined and very specific 
vocabulary. There is nothing comparable to O’Connor’s list 
of disorders in the vocabulary of engineering, and one might 
expect indexing rules to be different when the vocabulary is 
less precise. 

Xn conclusion, there are several major objections to most 
of the studies we have discussed; the particular human 
indexing chosen as a standard, the question of over- 
assignment, and the investigation of only a few possible 
artificial rules.. Ini each of these studies, the human 
indexing which acted as the standard was not all done by the 
same person or grcup of people. This is an important point 
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because of the effect it has on the rules the experimenter 
devises to account for the indexer's behavior. Let uis 
suppose that two indexers have rather different indexing 
practices. one of them (Indexer One) assigns an index term 
only if an exact match for a thesaurus phrase occurs one or 
more times in the document. The other (Indexer Two) assigns 
the term only if the exact match occurs two or more -imes. 
New suppose that Indexer One indexes X percent of the sample 
documents, and that Indexer Two does the remaining y percent. 
The experimenter could come up with a rule which says ''assign 
3. t occurs at least two times in the document". 
This rule will omit u'p to X percent of the assignments. if 
the experimenter decides the rule should be "assign the ■ term 
if it occurs one or more times in the document", then he will 
be over-assigning in up to X percent of the cases. if many 
indexers and many indexer assignment rules are involved, the 
hypothetical assignment rule devised by the experimenter is 
very dependent upon the particular mix of people who did the 
indexing. 



There are two ways to deal With this problem. First, all 
the documents could be indexed by the same person. The 
experimenter would then be looking foi a rule to explain the 
behavior of a single indexer. The second possibility is to 
have all the documents indexed by each of a group of people. 
This leads the experimenter to an explanation of an "average” 
uype of indexing, S.'tnce an individual indexer is unlikely to 
be following any rule consistently, the averaging would give 
an opportunity for individual variations to cancel out. 
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'The second major objection to most of the studies 
discussed above is that they have ignored or played dc^’n the 
effecits of over-assignment. The artificial rule must ccount 
for the non-assignment of terms as well as the assign "5nt of 
term.:. This difficulty was discussed above in connection 
with the Pels and daccbs and the Montgomery and Swanson 
sfcud:_es . 

Third, there has been no systematic in vestigat:.. on of a 
broc i spectrum of poss.lble hypothetical indexing ru s. As 
Section 2,4 deiucnstra-. es, a combination rule (clue -ine AND 
clu-. two) is very seliom employed. Investigat icr. of a 

broader range of rules would make it possible to say just how 
complex an artificial rule must be to imitate human indexing. 

Despite the assurances of an occasional devotee (Salton 
(1970)), there is no clear evidence that human indexing is 
’’machine-like”. The models proposed in this thesis are 
intended to investigate t.wo general types of machine-like 
rules to determine whether they do account for a large 
percentage of human indexing behavior. 
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2,3 Rules for Automatic Indexing 

Hypothetical indexing rules have jeen suggests for 
purposes other than imitating human indexing; much cf the 
literature on automatic or mechanical Indexing consists of 
tests of such hypothetical rules,- Xnstec.d of survevii the 
liter-ture of automatic indexing whica has been reviewed 
exhaustively and competently by Stevens (1970), we will try 
to summarize the types of rules prcoosed for automatic 
indexers. 



The automatic indexing rules, mentioned in the liter o-ure 
break down naturally into four general areas; 1) syntactic 
clues, 2) statistical clues, 3) textual clues and 4) 
assignment rules, in this section we will characterize the 
three types of clues and cite examples of each type. Section 
2.4 discusses the assignment rules. We are primarily 
concerned with the textual clues and the assignment rules 
because they provide a basis for understanding the models 
used in Chapters 3 and 4, 



2,3,1 Syntactic Clues 



Syntactic analysis 
understanding the meaning 
grammatical structure, 
basis of knowledge of 
automatic einalyzer finds 
the text as it parses the 



makes ei first step toward 
of text by unravelling the text’s 
Syntactic clues are chosen on the 
this grammatical structure. An 
the part of speech of each word in 
sentence. Unfortunately, this is 
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not 



simple process 



Syntactic analyzers are often quite 



compl-catad programs which can grociuce a number o£ alternate 
r-r-_2.. gs of 5 . single sentence. Dealing witli two sentences is 
be/c..d the abilities of most existing programs unless the 
vocabulary and grammatical structures are severely limited. 
Although Harris (1959) talked of kernalization of sentences 
and replacement of pronouns in 1959, only recently have there 
been programs which can actually perform some of these feats 
(Shapiro, et.al, (1969)). In fact, artificial intelligence 
eynerxmenters count the understanding of small portions of 
text about calculus a major success (Simmons (1970)21) mainly 
because of syntactic problems. 

There have been automatic indexing experiments with 
syntactic analyzers designed to search for specific types of 
syntactic clues, however, Baxendale (1958, 1962), Baxendale 
and Clarke (1966) and Clarke and wall (1965) identified noun 
phrases in natural language text with an accuracy of 91%, 
Unfortunately, this program has never become part of an 
automatic indexer, Klingbiel (1969, 1971) designed a program 
to read in natural language text, locate phrases which could 
serve as potential index terms, and display these phrases tc 
a human indexer. The human was expected to make the final 
indexing decision, . This analyzer recognized just thirteen 
syntactic types. 

While syntactic information will no doubt be an important 



indexing technique in future years, for the presp’"^ 




automatic 
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is more talkea«about than practiced. This clue type is 
r Qt included in either of the models in this thesis. 

1.3.2 Statistical Clues 

The statistical methods of isolating clues are really 
methods for locating content-bearing words in natural 
language text. Large quantities of text must b"? processed - 
isually by truncation and counting - to give statistical 
information about the frequency cf cccurrence- of text words 
in the language a>s a whole. The object is to locate words 
which have atypical distributions in the text. 

For instance^ Dennis (1965, 1967), in one of the earliest 
statistical experiments dealing with text, tested a number of 
statistical distributions intended to separate content- 
bearing words from the other words. About 3.8 million words 
from 2600 reports of law cases were keypunched. Then a 
number of statistical distributions were tested against this 
text to find one which characterized the content-bearing 
words. The content- bearing words identified by the 
distribution became the master indexing list. Every time one 
of these words appeared in a document, the document was 
assigned that word as an index term. 

Damerau (1965) performed similar experiments with one 
million words of world politics tews broadcasts. The object 
as the same: to find a statistical distribution which would 




accurately separate content-bearing words 



He found that 
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non-cont-ent-bearing words (often called ^'function” words) had 
a Poisson distribution through the documents since these 
words tended to be randomly distributed. Later# Stone (1967) 
and Stone and Bubinoff (1968) tried several modified Poisson 
distributions on a 70, 000 -word sample taken from Computing 
Beviews . Stone found that words with a Poisson distribution# 
since they occur randomly# are non-specialty or uninformative 
words. Specialty words have non-random, non-Poisson 
distributions. Stone developed two Poisson formulas and 
proved that one of them is analogous to Dennis* best 
separating formula. 

The identification of content- bearing words is a first 
step in the compilation of a list of keywords. And a list of 
keywords can be very useful when building a thesaurus. But 
such a list does not, in itself# act as an automatic indexer. 
For this reason# statistical methods of isolating clue words 
are not included in either thesis model. 

2.3.3 Textual Clues 

Textual clues (also called * machine-recognizable textual 
clues* or# simply *clues* in this thesis) are the most common 
raw material for automatic indexing algorithms. Textual 
clues are words or phrases produced by natural language text 
or obtained from it without benefit of syntactic analysis or 
statistical manipulat icns of large guantities of text. Since 
this is a definition-by-default# 
helpf ul . 



some examples might be 
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Many years ago luhn (1957) suggested the use of location 
as a textual clue. Words occurring in the title were 
supposed to be more likely to be good descriptors than words 
occurring in the body of the document. Other suggestions 
have been made for locations of textual clues. Baxendale 
(1958) thought the first and last sentences in each paragraph 
were good. O^Connor (1965) tried the first and last 

paragraphs of a document. Figure 2.01 lists the various 
locations or combinations of locations tried by various 
experimenters and references' the journal article in which 
each suggestion wais made. 



A second group of textual clues centers around a match 
between the text of the document and a word list of some 
sort. By far the most common type of match sought is an 
exact mat-ch between the document and a word list or thesaurus 
(see Figure 2.01). Fangmeyer and Lustig (1969V and 
Montgomery and Swanson (1962) accepted a partial match 
between the document and the word list. Other experimenters 
searched for stems of words, or utilized thesaurus cross- 
references as clues. ' . 

The last major group of textual clues is based on 
counting. Here, a count of the number of times a word is 
used in a document, determines whether that word is a clue or 
not. Some experimenters (see Figure 2.01 again) simply take 
the most frequently used words. others take words occurring 
at least X times in a document, or those which constitute at 
least X% of the document. This counting procedure is to be 
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contrasted with the procedures used to obtain statistical 
clues. Statistical clues are only available from large 
quantities of text (on the order of a million words) « The 
counting procedure discussed above operates only on the 
document at hand. It does not depend on statistical word 
distributions in the language as a whole. 

Each of the methods in ' these three major groups of 
textual clues is a way to obtain information about the 
subject content of the document from its text. The two other 
methods discussed in Sections 2.3. 1 and 2.3.2 for obtaining 
information about subject content (syntactic clues and 
statistical clues) require either rather complicated 
programming or large quantities of text. The textual clues 
mentioned here are by far the most numerous clue types found 
in automatic indexing experiments - probably because they are 
the easiest clues to obtain with present-day computers. For 
this reason f they are the clues modelled in this thesis. 
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2 ,U AssignK«?nt Rules 

An automatic indexing algorithm is a combination of two 
elements: the clues identified, and the assignment rules. 

Given a particular pattern of clues, the assignment rule 
decides whether those clues result in an index term. For 
example, suppose the clue-finding procedure looks for 
thesaurus words in the docisment in two places; ttre abstract 
and the title. An assignment rule might be the following: 
index term if th-e thesaurus word occurs once in 
least three times in the document”. The 
assignment rule keeps track of the locations, frequencies and 
types of clues appearing in the document. When the minimum 
assignment rule conditions for a particular thesaurus term 
are met, that term is added to the document*s index set. The 
assignment rule is simply an indexing procedure operating on 
textual information about the documents. 



Many studies have made use of very primitive assignment 
rules. The most common of these is: if an^ textual clue 

occurs, then assign the corresponding index term- (see Figure 
2.02), In some cases, several textual clue types are 

involved,' For instance, Artandi (1969) looked for two 
® ® ■*’ords in the same sente nee . Montgomery and 

Swanson (1962) searched for at least one of several clue 
types, Luhn (1957) searched for words in particular 

locations with high frequencies. O’Connor (1965) developed 
increasingly more complicated assignment rules for two index 







34 



28 



terms in the medical field. In fact, his assignment rules 
were different for each index term studied. 

Tn conclusion, we have seen that a number of hypothetical 
indexing rules have been proposed and tested in the pursuit 
of automatic indexing algorit^hms. Unfortunately for us, the 
results of these automatic indexing experiments are sometimes 
not. evaluated at all, are evaluated only in terms of the 
total number of terms in the index set, or are compared with 
the output of a single human indexer. Although none of the 
experimental results are particularly useful to us in 
deciding what proportion of human indexing can be accounted 
for by textual clues, these studies do give us valuable 
insight into hypothetical rules which could be used to 
imitate human indexing. 
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2.5 Clues and Assignment Rules Used in this Thesis 

The clues and assignment rules modelled in this thejsis 
are extensions of those found in the literature (see Figures 

2.01 and 2.02) with adaptations to accommodate the documents 

actually used. For instance, since the sample documents 
indexed are short and consist of just a title and abstract, 
just two locations for the clues are distinguished: title 

and abstract* On the other' hand, extensive use is made of 
information from the thesaurus for identifying clues. 
’Sections 2.5.1 and 2.5.2 describe and define the clues for 

the regression and combinatorial models. Section 2.5.3 

describes the assignment rules typified by the two models. 

2.5.1 Regression Model Clue Types 

In keeping with the breakdown found in the literature, 
clues have been divided into three general groupings: 

1 "type of match (6 different types in group) 

2 length of match (5 different lengths in group) 

3 location of match (2 different locations in group) , 

One element is taken frory each of the three groupings to 
constitute a single clue. For example, a main entry 

descriptor match (group one) of a three-word phrase (group 
two) in the abstract (group three) is a single clue. There 
are 6*5»2 or 60 possible clue types. 

The three short lists below constitute a complete display 
of each of the items in the groups. All possible clues are 
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formed by taking every possible combination of matches from 
the three groups® 



Tx£e 


Le gt h 


Location 


main entry 


three-'Word. phrase 


title 


stem 


two- word phrase 


abstract 


used-for term 


header 




broader term 


modifier 2 




narrower term 


modifier 1 




related term 







( 2.01 ) 



These sixty clue types may be thought of as a sixty-place 
string of numbers. The position of the number in the string 
indicates the clue type, the value of the number itself is 
the frequency of occurrence of that clue %,^pe. For example, 
the first number in the string of numbers is the position for 
three-word main entry descriptor matches in the title. If a 
*2* occurs in this location for a given document, there are 
two three-word main entry phrase matches for the thesaurus 
term in the title of the document. We call this sixty-place 
string of numbers a ’’clue vector*’ , There is a clue vector 
for each document-term pair analyzed. These clue vectors 
form the basis of the multiple regression model discussed in 
Chapter 3, 



Each of the matches is operationally defined by the 
computer programs used to isolate it. A definition of what 
constitutes a match between the document and the thesaurus 
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phrase is given below. Information on the computer programs 
may be found in Chapter 5. 

To understand what is meant by each component of a clue 
typQp considex; the following excerpt from a thesaurus.- 

Bad ia tion co unters 
BT Measuring instruments 

Radiation measuring instruments 
NT Beta spectrometers' 

RT Dosimeteis 

Ionization chambers 
V e rt ical takeof f a ircraft 
UF con vertiplanes 

where BT = broader term, NT = narrower term, RT = related 
term, and OP = used for. 

Main entry; the thesaurus and the document word (s) match 
exactly, character for character. A singular/plural 
difference is counted as an exact match. Thus ‘counters* in 
the thesaurus matches *counter*or ‘counters* exactly. 

Stem match: the stem of the thesaurus word and the stem 

of the document word (s) match exactly. The stem of a word is 
that part of a word to which inflectional endings are added 
or in which phonetic changes are made for inflection. The 
thesaurus stem *radia* matches the document stem *radia* for 
such unstemmed words as ‘radiation*, ‘radiate*, etc. 

Osed-For match; the UF references in the thesaurus match 
either the singular or the plural form of the word (s) in the 
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document, A used-^or matr^\ is counted foe the thesaurus term 
•vertical takeoff aircraft® if either ®convertiplanes» or 
• conve rti plane * occurs in the document. 

Broader term match: the BT references in the thesaurus 
match either the singular or the plural form of the word (s) 
in the document, A broader term match is counted for the 
thesaurus term 'radiation counters* if 'measuring 
instruments* or 'measuring instrument* or 'radiation 
measuring instruments' or 'radiation measuring instrument® 
occurs in the document. 

Narrower term match: the NT references in the. thesaurus 
match either the singular or the plural form of the word (sj 
in the document, 

■Related term match: the BT references in the thesaurus 
match either the singular of the plural form of the word (s) 
in the document. 

Three-word phrase; if the thesaurus term being tested is 
a three-word phrase, and the words occur in the document with 
no more than one intermediate 'of then a three-word phrase 
match has occurred, A three-word phrase match for 'vertical 
takeoff aircraft' occurs if either 'vertical takeoff 
aircraft' or 'takeoff of vertical aircraft' or 'vertical 
takeoff of aircraft' or 'aircraft vertical of takeoff, etc. 
occur in the document. 

Two-word phrase match: if the thesaurus term being 
tested is a two word phrase, and the words occur in the 
document with no more than one intermediate 'of then two 
word phrase match has occurred. 
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Header match: if the right-most word of a' multi-word 

thesaurus phrase, occurs in the document, or if a thesaurus 
entry of a single word occurs in: the document, then a head ’• 
match is counted. If either * dosimecers* (a thesaucus entry 
of a single word) or 'chambers* (the right-most word of 
•Ionization chambers*) occurs in the document, a hea.der match 
is counted. If a thesaurus term of the form 'card punches 
(data processing)* occurs, the parenthesized exjfressicn is 
ignored. In this case • punches* is the right-most word of a 
two-word phrase and is there-fore the header. 

Modifier 2 match; if the second word of a three-^ord 

thesaurus phrase, or the left-most word of a tw'o-word 

thesaurus phrase occurs in the document, then a modifier 2 

match is counted, A modifier match for *vertical takeoff 
ui.rcraft* is counted if 'tSikecff* occurs in the document; a 

modifier 2 match ftl:,: 'radiation counters* is counted if 

"•radiation* occurs in the document. 

Modifier 1 match; if the first word of a three-word 

thesciurus phrase occurs in the document, then a modifier 

is counted. The word 'vertical* is a modifier 1 match 
for 'vertical takeoff aircraft*'. 

Title match; if the word (s) being matched occur in the 
title, then a title match has occurred. 

Abstract match; if the word (s) being matched occur in 
the abstract, then an abstract match has occurred. 



The 

counted 



textual clues occurring in the document may be 
more than once. If both 'vertical takeoff aircraft* 
occur in the abstract, this counts as one 
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a nd ® a i rc ra f t * 



exact three-word phrase match 1b the at>st ..ac aud. exact 

1' der matches in the abstract o This method cf counting 
assures that each clue is counted indept .. '?ncly of all 
others. 

2.5.2 Combinatorial Model Clue Types 

i 

The regression model in Chapter 3 and the combinatorial 
model in Chapter 4 have been tested with the same clue types. 
However, the additive properties of the regression model and 
the Boolean properties of the combinatorial model require 
somewhat different reporting schemes for these clues. The 
regression model simply records the count of th»^ number cf 
times a clue appears in the document. The combinatorial 
model uses Boolean combinations*) sc the numbers in the clue 
vector must be bineiry (either one or zero) . This is 
accomplished by translciting the single-cell count of the 
regression model into a binary record. ' There is a zero in 
the binary record if there is a zero in the corresponding 
place in the regression model record. There is a one in the 
binary record if there is a number greater than zero in the 
corresponding position of the regression model. The binary 
vector simply records which clue types, are present in the 
document, A zero value in a binary clue cell means the clue 
type did net occur in the document; a one means that . one or 
more clues of that type occurred in the document. 

This particular pattern for the binary clue vector was 
chosen for two practical reasons: 1) for the size of 
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documents used in the sample^ there is little necessity to 
record broad frequency ranges since high frequency clues are 
not common and 2) additional clue. types increase 
computational time considerably. In theory, there is no limit 
to the occurrence frequencies which cculd be represented by a 
binary record, however. As with the regression model, there 
is cne clue vector for each document -ter m pair analyzed. 

The following three short lists summarize the clues used 
for the combinatorial model.' 



Z21S§. 

main entry 
stem 

used-for term 
broader term 
narrower term 
rela ted term 



Leng th 

three-word phrase 
two-word phrase 
hea der 
modifier 2 
modifier 1 



Lo cat ion 

title 

abstract 

( 2 , 02 ) 



These lists are identical to those in Equation 2.01 except 
for the the modification of the options in the location 
group. 

As with the regression model, all possible clues are 
formed by taking every poss-ible combination of matches from 
the three groups. There are a total of 6«5*2 or 60 possible 
clues . 







2«5.3 Assignment. Rules Used, in the Models 



The models in Chapters 3 and 4 are intended to test a 
number of possible assignment rules in a systematic fashion. 
Each model tests a different class of assignment ruJ.es 
although in a certain number of special cases the two kinds 
of assignment rules are mathematically equivalent. 

The class of assignment rules tested by the combinatorial 
model are a particular set of Boolean equations formed from 
combinations of the sixty binary clues. These Boolean 
equations are of the form (clue-type-'i AMD clue-type“2V OR 
(clue-type-3) OR (clue-type-4 AND clue-type-5) . Translated 
into a model of human indexing, the above 'equation would 
reads if clue-type-1 AND clue-type-2 OR if clue-type-3 OR if 
clue-type-4 AND clue-type-5 are present in the document, then 
assign the thesaurus term. These equations are covered in 
more detail in Chapter 4. 

The class of assignment rules tested by the multiple 
linear regression model is of a different form: Y - A + B -j 

(number of clue-type- 1 - occurrences) + B 2 (number of clue- 
type-2 occurrences) + ... , + Bjj (number of clue -type-n 

occurrences) . , Translated into a model of human indexing, 

this equation would read: To a constant. A, add the 
coefficient multiplied by the number of times clue-typc-1 

occurred; then add the coefficient E 2 multiplied by the 
number of times clue-type-2 occurred; etc. The sum Y is the 
percentage of indexers assigning the term. The multiple 
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linear regression model looks for additive combinations of 
the textual clues. Each clue is weighted arithmetically by 
the coefficients so the total score for a particular term is 
a sum of the fractions of all of the clues considered, tSee 
Section 3.2 for a detailed discussion of this weighting.) 
The object of the regression calculations is to find the 
"best” values for the constant and coefficients. Chapter 3 
discusses the regression in more detail. 

As mentioned above p in a certain number of special cases, 
the Boolean combinatorial model and the multiple linear 
regression model are equivalent, A branch of switching 
theory, called ^’threshold logic", deals with • this 
equivalency. Threshold logic (Lewis and Coates (1967)) is 
concerned with converting binary circuits (or Boolean 
equations) into threshold circuits (or a sequence of linear 
equations), A number of methods are available for 
"realizing" (converting from Boolean to) a threshold logic 
element. All Boolean equations can be realized by one or 
more threshold logic elements. However, only a few Boolean 
equations may be converted to a' singl e linear equation. When 
just a single threshold- element is needed, the Boolean 
equation is said to be "linear ,iy separable". If there are 
two Boolean variables (in our case a Boolean variable is a 
clue type), then there are 16 distinct Boolean functions of 
which 14 are linearily separable. If there are three Boolean 
variables, then there are 256 distinct functions of which 104 
are linearily separable. When the number of Boolean 
variables is equal to or greater than 4, the percentage of 
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linearily separable functions decreases rapidly (Torng (1966J 
20) , The equivalency between the best Boolean and the best 
regression models is discussed in Section 6,2,3o 

Since the regression model does not permit testing of 
many Boolean selection rules because of the low density of 
linearily separable functions, a Boolean combinatorial model 
is also desirable. In this thesis, one particular group of 
Boolean assignment rules is tested exhaustively to uncover 
the best set of Boolean equations for the sample documents. 

Both models assume that the same indexing procedure or 
assignment rule applies to all terms in the thesaurus. This 
is consistent with the approach taken by all automatic 
indexing studies with the exception of O’Connor who devised a 
different rule for each thesaurus term. Both models are, of 
course, dependent upon the particular clue types chosen by 
the experimenter. Neither model can disclose the importance 
of clue types not included in the model. 
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2.6 Subject Experts; versus Librarians 

If indexers do little more than pick good words out of 
the document, then a high level of subject competence may not 
be necessary. On the other hand, if iudexers make 

intellectual decisions requiring knowledge about technical 
subjects, potential users of the system, etc. , then subject 
expertise is an obvious prerequisite. 

Although comparative studies have been made of author- 
indexers versus professional indexers, no comparison has been 
made of the dependence of the two groups on the textual clues 
in the document. One would expect that scientist-indexers 
would depend less on the actual words used in the documents 
because of their greater understanding of the subject matter. 
Libra ria n -inde xers would not have the benefit of subject 
familiarity and would, therefore, be more dependent upon the 
words actually used in the document when indexing. 

To test this hypothesis, two grou'ps of indexers have been 
\ used as subjects for this study'. The first group consisted 
of six librarian -indexers . Each of the librarians had an 

M.L.S. degree from an accredited school. Each had spent some 
time either indexing or cataloging in a special library in 
the field of engineering or science. Each had worked on a 
reference desk answering questions from patrons of the same 
kind of library. Each was familiar with the standard 
scientific and engineer.ing abstracting journals. 
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The second group consisted of six scientist-indexers. 
Each of these scientists or engineers had at least an 
undergraduate degree in engineering or the hard sciencej.-. In 
some cases, the scientist had an H.S, or a PhD. Each was 
earning a living as a scientist or engineer at the time of 
the study. The documents used for the experiment were in the 
field of instrumentation. This topic was chosen because 
scientists and engineers familiar with that subject were 
available to do the indexing. 



O 

ERIC 



47 



4 1 



Figure 2»01 Table of Textual Clues 

A. Besearchers using location as a clue: 

title*, abstract, headings, text, references, figures 
Edmundson and Wyllys <1961) 
first and last paragraphs 
Luhn (1957) 315 

O’Conivor (1965) 499 

title and first paragraph 
Swanson (1963) 
title, abstract, full text, 

Luhn (1959) 

first and last sentences in paragraph 
Baxendale (1958) 

B. Researchers using type of match as a clue: 

thesaurus or word list matches 
Artandi (1964, 1969) 

Bloomfield (1966) 

Fangenmeyer and Lustig (1969) 

Harris (195'^) 

Luhn ( 1959) 

Meyer-Uhlenried and Lustig (1963) 

Montgomery and Swanson (1962) 

O’Connor (1965) 

S Alton (1S(6 8) 26 

Slamecka and Zunde (19,63) 

Swanson (1960) 

Zunde (1965) 

part of a thesaurus phrase 

Fangenmeyer and Lustig (1969) 

Montgomery and Swanson (1962) 
cros^- ref er ences from the thesaurus 
Fangenmeyer and Lustig (1969) 
stem matches 

Fangenmeyer and Lustig (1969) 

Luhn (1958) 

Salton (1968) 30-33 

Zunde (1965) 

multi-part clue expression with variable substitutions 
O’Connor (1965) 

C. Researchers using count and frequency criteria as clues: 

absolute frequency counts 
Baxendale (1958) 

Jones, Giuliano and Curtice (1970) 

Luhn (1958) 

relative frequency counts 
Artandi (1969) 218 

O’Connor (1965) 499, 508 

most frequent words 

Luhn (1957, 1958) 

most frequent word pairs 
Baxendale (1958) 

Edmundson and Wyllys (1961) 
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Figure 2,02 Table of fissignment Pules 

a match with the thesaurus or with a word list 
Rrtandi (1969) 

Bloomfield (1966) 

Fangmever and Lustig (1969) 

Harris (1959) 

Jones, Giuliano, Curtice (1970) 

Montgomery and Swanson (1962) 

Salton (1968) 25-48 

Zunde (1965) 

most frequent words in first and last sentence 
of each paragraph 
Bax end ale (1958) 

no more than X non-significant words sepF’ ’a ting 
significant words 
O’Connor (1965) 

two significant words in the same sentence 
Attandi (1969) 219 

two significant words within twc paragraphs 
Luhn (1957) 

at least X occurrences of thesaurus words per 
Y words of text 
O’Connor (1965) 

title, heading, resume and frequency 
luhn (1957) 
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Chapter Three 



THE MULTIPXE LINEAI? 
HEGBESSIOH HOEEL 
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3. The Multiple Linear Begressicn Model 



3o1 Desirable Characteristics of an Indexing Model 



An ideal model of textu ally -clued indexing would have 
several properties. Pirst, it should answer the question 
**How strong is the relationship between the clues in a 



docu m 


en^ 


an d 


the index terms assj 


Iqned 


to 


that 


document?” 


The 


a nsve 


r 


to 


this question would 


tell 


US 


just 


how 


much of 


the 


index 


ing 


ca n 


be accounted for on 


the 


basis of 


the 


clu es . 





secondly, it should be possible to make some 
statistically valid statements about the entire population of 
indexers and documents with the information obtained from the 
single sample. We would like to be able to infer that the 
relationship found in the sample also holds for the 
population as a whole. 



Thirdly, the model should be able to be used 
predict! vely. It should say whether a particular index terra 
would be assigned to an arbitrarily chosen documents This 
prediction might not be just a yes/no decision, but could 
also be, say, a prediction of the percentage of indexers who 
would assign the term to the document. If it turned out that 
there were oiily a small statistical relationship between the 
clues and the indexing assignments, then this predictive 
property would not be ox much practical importance since the 
model could not function in place of the real indexers. If, 




oX 

, ^ • ■ 



lib 



however^ there were a strong statistical relationship# a 
predictive model could be substituted for the indexers. 

Because of the capability of giving strong answers to 
these requirements, multiple linear regression has been 
chosen as our first indexing model. Since this ” odel assumes 
a. linear relationship between the index terms assigned and 
the clues, a second model has also been built. This model, 
called the combinatorial model, does not assume linearity. 
The multiple linear regression model will be discussed in 
this chapter and the combina.tor iaL model in the next. 
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3« 2 The Hathemat ics of RegresKicn 



This section gives a cursory explanation oi: multiple 
linear regression. Although many statistics t"'::ts treat the 
subject, most discussions are difficult to read. The 
following books may be consulted for more detailed 
discussions: Hays ((1963) 490-577), Ferber <(1949) 346-379), 

Ostle ((1963) 159-243), and Draper and smith (1966). 



degression is a common statistical technigue used to show 
the linear relationships among two or more variables* For 
instance, we would like to know whether the index terms 
assi.gned to a document are related to the occurrence of 
textual clues in t!ie document. In this case, the dependent 
variabl' is the percentage of indexers . who assign a given 
index term and the Independent variables are the various 
types of machine-recognizable textual clues in the document. 

Assume for the moment that several indexers individually 
choose index terms from a thesaurus for the same document* 
In effect, the indexers are voting for the set of most 
popular index terms from among the potential thesaurus 
candidate terms. Some index terms will receive many votes, 
others fewer, most will receive no votes at all. Each of the 
potential thesaurus candilate ■'•erras considered by the indexer 
gi.oup is a single experimental event. This experimental 
event consists of the (n+1) numbers; 
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1 number of times clue type 1 occurred in document# 

2 number of times clue type 2 occurred in document# 



n number of times clue type n occurred in document# 

n+1 percentage of indexer group voting for term. 



For example# let us suppose the document indexed has the vord 
•computers' in it twice and that the index term now being 
considered is ‘computers®. If clue type 7 is the exact match 
between the index term 'and word in the document,, then clue 
type 7 occurs twice in this document; therefore the number in 
che seventh place in the (n+1) -tuple is a 2, The numbers 1 
through n form the clue vector discussed in Section 2.5.1, 
The clut types used in the model are also listed in that 
section. 



Each experimental event is represented numerically by n 
(n+1) -tuple where n is the number of known dc nt 

characteristics. In this case# n is the number of types of 

machine-recognizable textual clues tested by the experiment. 
The remaining point in the (n+1) -tuple is the dependent 
variable or the percentage of indexers assigning that term, to 
the document. 



As each of 



cotisidered in 
represent the 



the potential thesaurus candidate terms is 
turn# a new (n+1) -tuple is produced to 
differing percentages of indexers who assign 
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•the -term and the different quantity of textual clues in the 
document for that term. If all indexers index the same 
documents uith the same thesaurus^ then there uill 

N - (documents indexed^ * (size of thesaurus) (3.01) 

experimental events or (n+ 1 ) -tuple s . 

Each of these experimental observations c?;". be 

represented in (n+1) -space as a s-^ ngle point. The object of 
the multiple linear regression is tc fit the best straight 
line through these points. This Line is fitted so that the 
summed squared deviations of the points from the line are 
minimized. 

The equation of the resulting straight line is the 
classic one: 

Y = A + ®n^n (3.02) 

% 

where A is a constant, the X*s ar«s the n clue types, and Y is 
the proportion of indexers assigning the term. 

The B*s can be thought of as weights for each clue type 
in the regression equation. Equation 3.02 can be re-written 
as : 




DO 



Y 
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= a + B-j • (clue "type 1) 

+ B 2 ® (clue "type 2) 

• 

•«' B^i ® (clue type n) . 

Here the B*s weight the clue types so -i&t the sum of each of 
the terms in the equation totals to Y« 

Notice the acTditive nature of the effects of the various 
r*3,ue typeSo This model says thav. an indexing decision is 
based on a weighted sum of all clue types, each clue type 
adding its evidence to the total evidence available for that 
index term. This assumption of linearity is basic to the 
regression model. It allows us to find the single best- 
fitting straight line for the data. 

The use of multiple linear regression requires two 
assumptions about the data. These assumptions are not needed 
to celculate the correlation coefficient, but are required to 
say how good the correlation coefficient is as an estimator 
of the true population coefficient and to set confidence 
interval.s. , The first of these assumptions, normality, says 
that for a given X value, the Y values are distributed 
normally about a mean. When only a single X value is 
involved, or the values of X*s can be controlled by the 
experimenter, the data can be inspected to see whether the 
assumption of normality is justified. Since our model has 
many X» & whos. xues are not under experimental control, it 
is very difficult to determine how the Y values are 
•J Istrlbvited , I v. turns out, however, that deviations from 

ERIC 

i>6 



50 



normali-ty do not have a serious influence on the regression 
model (Scheffe (1959) 350,360-368). Regression is not very 
sensitive to non-normality. 

However, the upper and lower bon ds set on the 
correlation coefficient are very dependent upon 
homoscedasticity « The homosceda st ic ity of a variable is the 
degree to which its variance is constant; that is, the degree 
to which the variance of Y given X is the same for all X. 
CJnegual variances play havoc with the setting of confidence 
intervals. One way to deal with non-homosceda stici t y is to 
squeeze out the effect of unequal variances with 
transformations of the X values^ A number of transformations 
can be made (Dixon (1970) 17-19) . 

One way to examine the data for unequal variance is to 
plot the residuals cf the regression for each independent 
variable age^inst the dependent variable. A residual is the 
difference between the Y actually measured in the experiment 
and the Y value calculated during the regression. The 
calculated Y value is the appropriate point on the best-fit 
3-ine drawn by the regression through all the data points. If 
the residuals for a given variable show a marked tendency to 
scatter in a particular pattern, then transformations of the 
data are probably required to assure homoscedast icity . 



The regression 
(see Section 3.6) 
on demand. 



program hsed for calcr*lat ions in this thesis 
could produce the required residual , plots 
of the plots of residuals tor the 



Examination 



51 



major independent: variables shoved no distinct t<andency in 
the scatter# Although there was a tendency for values to 
cluster at the low end of the x-axis where the independent 
variables (clvie types) had values of 1 or 2, this effect was 
primarily due to the sparseness of high— valued observations. 
This was d,ue to the fact that clues had a tendency to occur 
once, or twice, but seldom six or eight times in a single 
document. Of course, this meant that more data was aveiilable 
on the low end of the -scale. The higher values seemed to be 
randomly scattered thoughout their ranges. For this reason, 
transformations of the original data were not necess*..ry to 
preserve homoscedasticity . 




S8 



52 



3.3 The Correlation Coefficient 

The correlation coefficient, R, is a measure of the 
strength of the linear relationship between the index terms 
assigned and the textual clues in the documents. 

If the distributions of X and Y are similar, then F may 
take on any value from an extreme low value of -1 to an 
extreme high veilue of 1 (Hays (1963) 510) . When the 
distributions of X and Y are very disssiroila r , these extremes 
can shrink considerably (see Carroll (1961)) . We would 
expect our X and y distributions to be very similar. Most of 
the values of these two variables will be zero; a middling 
number of observations will have low values (one indexer 
assigns, or a clue occurs once in a document) ; fewer will 
have mid-range values (several indexers assign the same term, 
the same clue occurs several times) ; very few observations 
will have high values (almost all indexers agree to assign, a 
particular clue occurs many times in the document). An 
inspection of Figures 5.07 and 5.09 bears out this 
expectation* The indexers in Figure 5.07 have a tendency to 
E<ake unique assignments; terms assigned by many indexers 
occur Infrequentl y. The same distribution is evident in the 
totals of Figure 5.09. A particular clue type is usually a 
unique occurrance in a document. Since the distributions of 
X and Y in our data are very similar, E has a -1 to +1 range. 



A +1 value of E means that the X and Y variabxes are 
perfectly positively correlated. In other words, Y varies in 
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the same way af«d in the same direction as X because the 
possiblfa values of X and Y lie on a straight line with a 
positive slope# If R hais a value of -1, then X and Y are 
perfectly negatively correlated. This means that possible 
values of X and Y lie on a straight line with a negative 
slope. Between these two extremes. R can be zero. This 
means that X and Y are uncorrelated or linearly unassociated 
with each other. Two completely random phenomena exhibit a 
correlation coefficient of zero. 



A correlation of +1, however, does not mean that there is 
^ causa 1 relationship between X and Y, nor does a correlation 
of zero mean that X and Y are statistically independent. »7e 
are simply ob'^arving that X and Y vary in a particular 
fashion, we are not saying why this variation occurs. 

It should be noted that it is always possible to make R 
equal to 1 by increasing the number of independent variables 
to equal the number of observations made. As long as the 
number of variables (clue types) remains low in comparison to 
the number of observations, there is no danger of forcing the 
value of R to one. Thus, our ratio of 61 clue types to 6379 
observations will not prejudice the value of R. 



Recall from Section 3.2 that we have been using summed 
£?quared deviations as a measure of the best fit regression 
line. Again using summed -squared deviations, the total 
variance exhibited by the data is equal to the summed squared 
deviations of the actual Y«s from the average Y. This 
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assumes that we merely averaged all the data. In fact^ 
however, we ar© positing a liinearr relationship between X and 
Yff SO the deviations we have no£ been able to explain by the 
regression equation are the sumrosti sguaired deviations of the 
actual Y*s from the Y"s predicted by the regression equation. 
The explained variance is then the summed square of the 
difference between the s computed by the regression and the 
average Y. If we divide the explained variance by the total 
variance, then we have a measure of the amount of variance 
accounted for by the regression,, or a measure of the 
’’goodness" of the regression. This statistic ist 

Bz = explained variance / total variance (3.03) 

and it is expressed as a percentage. In fact, it is the 
percentage of variance accounted for by the regression. Note 
that when P is either +1 or -1, P^ is also one and that when 
P is almost zero, pz is also lost zero. 

For our purposes, then, z is the percentage of indexing 
accounted for by the regress jn. As far as the regression 
model is concerned, it is the percentage of indexing behavior 
which can be accounted for by the use of textual clues. 

We would like to say how good the correlation coefficient 
is as an estimator of the true correlation coefficient. Any 
value of P may be transformed to a new variable, Z, in the 
following way (see Edwards (1967) 248-250 or Hays (1963) 530- 
531) . 

^ " 
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Z - (In (1 + R) - In (f - R))/2. (3.04) 

Risher (1921) has shown that, the distribution of Z is very 

I 

close to normal with a mean of zero and a standard deviation 
of 1 and that. Z is independent of the sample size., The 
standard error of Z isi 



S = 1/y(K - n - 1). (3.05) 

vrhere N is the number of t?il even-fs and n the number 
of variables. The correlation coefficient for the entire 
population therefore lies between an upper bound of (Z + s • 
K) and a lower bound of (2 - S • K) where K is the percentage 
cut-off point on the normal curve (for a 99% confidence 
interval# K = 2.58). These upper and lower bounds on Z may 
be transformed back into R values so that a confidence 
interval may be set around the correlation coefficient. 

We will be comparing the correlation coefficients 
obtained from different experimental sub-samples and would 
like to test the significance of the difference between two 
correlation coefficients. The Z transformation also permits 
this kind of test (Edwards (1967) 250-252). 
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3.4 ■Relative Importance of Clues 

•s. 

The regression program discussed in Section 3/,6 adds 
variables to the regression equation one at a time, giving 
information^ after each addition about the improvement to 
caused by each variable. It will therefore tell us how much 
each variable contributes to the final value of Ez. 

< 

It is the improvement in R^ effected by a clue as it 
enters the regression equation which indicates its impcrtance 
in accounting for the inde'Jting (see Section 3.3) . R2 
measuires the sum of the direct and indirect effects of each 
variable. A full discussion of tSxe relative importance of 
clue types in the best regression equation will be found in 
Section 6,3. 




O 
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3.5 Prediction with the Begression Model 

After the line described by Equation 3.02 has been 
determined for the sample, the vjalues obtained for the B 
coeff ici.ents can be used to predict values of Y tor new 
documents. Since the B coefficients describe a line which is 
the closest fit for the experimental points, this line is the 
best available predictor for new documents. 



Let us assume we wish to index a new document with the 
prediction fanctrjon of our regression equation. For each 
thesaurus descriptor to be considered by the model there will 
be a set of X values, n per descriptor. The B coefficients 
have already been calculated from the sample documents. To 
estimate the percentage of indexers who will assign the first 
descriptor, the appropriate B and X values are multiplied 
together and the terms summed to get the value of Y. 

The arithmetic is simple enough; logic subjects the 
process to some restrictions, however. First, it would 
obviously not be profitable to use the equation if the 
correlation coefficient itself is, not high. If only a small 
part of indexer behavior can be accounted for by the textual 
clues, then it doesn't maKe much sense to try to use the 
clues as a substitute for human indexing. 



Secondly, even if the average F is high^ there may be a 
group of documents or terms for which the B is quite low. 
Thus it is important to Jcviow just how well the equation 
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predicts assignment for each of the sample documents. If the 
predicted Y values vary H„ldiy from the actual Y values for 
the documents in the sample,, the use of the regression 
equation for prediction is not reasonable. 

Thirdr we must not forget that there may well be some 
uncontrollable variables or some peculiar characteristics of 
the sample document set or indsixers which influence the way 
clues are used. It would not be fair to generalize, for 
example, from a single sample of documents about instruments 
and instrumentation to all documents in any technical field. 

fourth, it is guite possible that the predicted value of 
Y may not fit our practical notions of what makes sense. The 
values of Y for the sample lie between- zero and one (1 > Y > 
0> because they represent the proportion of indexers 
assigning the term. Since proportions may not be negative or 
greater than one, negative vailues of Y and values of Y 
greater than one cannot occur. It is possible that when the 
regression equation is used on new documents, some particular 
combination of X*s will make the predicted value of Y for the 
new document lie outside the zero to one common-sense limits. 
Statistically, there is nothing wrong with a predicted Y < O 
or Y > 1. If this occurs, we simply correct a Y < 0 to a 
zero and a Y > 1 to a one. 



O 
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3.6 Computer Program for Multiple Ll.ieat Eegression 

The regression calculations were done with a stepwise 
multiple regression program^ BHD02B, available from the 
University of Califorria at Los Angeles, Health Sciences 



Computing Facility. This prog 
multiple linear regression eg 
for the independent variable (t 
correlation with the depend 
indexers assigning). The re 
calculated. The independent 
correlation with those already 
and the regression equation re 
adds one new variable to the ca 
((1966) 163-195) may be consult 

computational procedures for re 
regression. 



ram calculates a series of 
nations. The program searches 
he clue type) with the highest 
ent variable (percentage of 
gression equation is then 
variable with the next- highest 
in the equation is then chosen 
calculated. Each new equation 
Iculaticns. Draper and Sraith 
ed for a discussion of various 
gressions including stepwise 



After each variable is added to the equation, the program 
prints the multiple correlation coefficient E, the coef- 
ficient of multiple determination pz, the standard error of 
estimate, an analysis of variance table, the regression 
coefficient, the value of F and the standard error for each 
variable in the equation, and other useful statistical 
information. Scatter plots of the residuals of each 
independent variable against the dependent variable ace also 
available. The method for obtaining this stepwise 
information, not ordinarily available, was suggested by 
Efroymson (1960) . The program will accept a maximum of 80 



O 

ERIC 






60 



variables and 9999 experimental events or eases. Complete 
documentation of the program may be found in Dixon ((1970) 
233-257) . 

The regreiasion program is written in Portran XV (H level) 
and uses Assembly language subroutines. A regression run for 
about 6400 experimental events and 61 variables requires 
about an hour of cpu time and from 3 to 12 hours elapsed time 
on an IBM 360/65. Confidence intervals were calculated with 
an interactive mathematical system called APL/360. 
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Chapter Four 



THE COMBTNRTORIAL MCDEL 
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4, The Combinatorial Model 

4,1 Eeasons for the Combinatorial Model 

The combinatorial model is intended to cover exhaustively 
e. class of non“linear assignment rules of the type discussed 
in Section ‘ 2.5,3. This model does not assume a linear 
relationship nor does it make an assumption of normality » 
except in a Central Limit sense. 



O 
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4.2 Types of Indexer - Model Agreemeni- 

Let. us assume we have a black box model of indexer 
behavior. If -this model is fed the textual clues from a 
document and a term from the thesaurus, it replies with a 
yes/no answer* Let us also assume we have a human indexer. 
If this indexer is given an index term from the thesaurus anl 
is asked whether that term should be assigned to a document, 
he, too, can give a yes/no answer. 

This leads to four types or cases of indexer-model 
agreement for the assignment of a particular index term to a 
particular document: 

case neither indexer nor model assigns term^ 

case 2) both indexer and model assign term, 
case 3) indexer asssigns term, model does not, 
case 4) model assigns term, indexer does not, 

Tf the model always agrees with the indexer (case 1 and case 
2 only) , then it will be a perfect predictor the human. 

The greater the number of decisi _ 3 and case 4 

type, the worse the model is as a . predictor of the human. 

Assume, for the moment, that we wish to test the ability 
of a single textual clue to predict a human indexer’s 
performance. Further, assume that each sample document has 
been tested for the presence or absence of this clue for each 
of five possible thesaurus terms and that the human has also 
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registered his yes/no decision. The results of this test can 
be summarized in the fallcwing vay: 











indexer 


case 


document 


and term 
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no 
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We can summarize this example as: 












t se 1 
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case 3 case 4 






document 
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1 1 






document 
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0 0 






The case 1 


through 4 tot a' ^ for 


each document indicate 


ho w 


accurately th 




model predicts the performance of the 


indexer 


based on a single textual 


clue. If 


the model agrees 


with 


the 


indexer all of 


the timer 


only case 


1 and case 2 exist 


(as 


document 2 


i3.1u St rates) 


. If the 


model is less success: 


f Ul r 



then case 3 and case 4 conditions may also exist (as document 
1 illustrates). 
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This example has a thesaurus of 5 terms. An increase in 
thesaurus size almost necessarily increases the number of 
case 1 occurrences (neither the model nor the indexer 
assigns) since the index set for a document is not a function 
of the thesaurus size and seldom contains more than five or 
ten terms. in a larger thesaurus, the overwhelming majority 
of case 1«s completely swamp out the other cases. such a 
preponderance of agreement leads to an arithmetically 
impressive model, but since . the case 1 agreements carry 
almost no inf Crma tion and disguise the occurrences of the 
rest of the cases, they must be dropped from, the model. 



For document 1, then, the model correctly predicted 1 out 
of 3 non-trivial assignments (that is, non-case 1 as- 
signments) and thus accounted tor 33% of the indexer’s 
performance with a single textual clue. For document 2, the 
model predicted i out of 1 non-trivial assignments, 
accounting for 100% of the indexer’s performance. The non- 
trivial assignments are a measure of how well the model 
matches the indexer. The figure-of -merit for non-trivial 
.. assignments is calculated as: 



Figu re-of -mer it = 

(case 2) / (case 2 ♦ case 3 + case 4). (4.01) 



This 



figure -Of -merit is often called a ’’precision ratio” and 
in document systems to quantify the success 
in answering requests. Becker and Hayes 
point out that this measure ’’attaches no 
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weight- a-t all to agreement in 0*s and is therefore onJ-V 
suitable where the proportion of i*s to 0*s is low. It 3-?j 
the ttiost obvious definii’ion in those cases where, at ainy 
in Principle, the columns are indefinitely long but the 
nUjnb^r of 1»s in each is fixed or (statistically) limited'* 
37T) . The 0*s of Becker and Hayes are our case 1? 
theit 1 *s are our cases 2 through 4, Since our thesaurus 
v®ry iJarge in comparison to the number of terms in a singi^ 
dQc'^^iient ’ s index set, our situation is an appropriate one ih 
whi^h to use Equation 4 ,ot, 



hlthouqh this measure is an appropriate one for us, the^^^ 
is nothing in the combinatorial model preventing the use of ^ 
different f igure-of -merit . In fact, a second f igure -of -merf ^ 
iS ^ttiployed in Section 6,2 for the comparison of the Boole^h 
cQmhinator ial model and the regression model. This second 
f ignte“of -me rit includes case 1’s; 

J’raction of all predictions modelled = 

(case 1 + case 2) / (all cas-^s) , <" 

The following papers may be consulted for more extensi‘<^*& 
discussions of measures of nearness or coefficients 
association: Kuhns (1965), Jones and Curtice (1967), 



haVe calculated the success of the single-clue model 
in pt’edicting a single indexer“s behavior for two. documente* 
The calculation can be repeated for any number of document^* 
we could then report on the average success of the singi^'* 











67 



cl’ae model 
by averaging 
documen-t s . 



in predicting that particular indexer's behavior 
the scores obtained for each of the individual 



We could also obtain this average figure-of -merit for 
each of a group of indexers. We could then average the 
averages to obtain an over-all figure of merit to summarize 
the success of the single-clue model in predicting group 
indexing behavior. 



Similar calculations could 
be used in the model. We could 
figure of merit for each of 
ones did a better job of predict 



be made for any 
then compare 
these clue type 
ing human index 



other clue to 
the over-all 
s to say which 
ing behavior. 
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4.3 Boolean Combinations of Clues 

The Boolean combinations covered in this section are 
intended to test a group of textual clues and a class of 
selection rules in an exhaustive fashion. Obviously^ other 
combinatorial rules can be imagined and tried out^ and other 
types of . textual clues could also be investigated. If this 
first, exhaustive trial is successful, additional refinements 
might be worthwhile. 

We will be using two types of Boolean operators to 
represent two types of indexing behavior. If the indexer 
behaves as if both of two clue types are required to motivate 
assignment, then AND behavior is displayed. For example, 
suppose we consider the thesaurus term • radiation 
Suppose an indexer assigns the term only if the woi. 1 
'•radiation* and the word ‘counters* are both present (but not 
necessarily contiguous) in the document. The indexer is 
saying ‘radiation* AND 'counters* lead to the assignment of 
•radiation counters*. This is AND behavivor. Of course, AND 
behavior may combine more than two clue types in a single 
expression. 

If the indexer behaves as if either of two clues could 
motivate him, then he is exhibiting OR behavior. For 
example, suppose tka thesaurus term is 'vertical takeoff 
aircraft* and the indexer assigns the term whenever either 
the term itself or a used-for reference, • convertiplanes* , or 
both occurs in the document. Either 'vertical takeoff 
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aircraft* OB • con vsr tiplanes * leads to the assignment of 
•vertical takeoff aircraft* , This is OB behavior. OF 
behavior may also combine more than two clue types in a 
single OBed expression. 

The exhaustive Boolean combination proceeds in the 
following manner. First, each single-clue type is tested. 
Each of the documents in the sample is tested for the 
presence of each clue type for .each term in the thesaurus. 
The presence or absence of a clue type for a thesaurus term 
is recorded in a yes/no indicator. Then, as discussed in the 
previous section, each indexer's behavior is compared against 
the single-clue model and the results summarized by the 
average figure-of -merit discussed there. The individual 
indexer f igure-of-merit for a particular clue type is 
averaged to yield an over-all figure-of -merit for each single 
clue. This information is saved for later use in the model. 

Next, the yes/no indicators for every pair of clues are 
ANDed together. This produces a new yes/no indicator for the 
presence or absence of that ANDed pair of clue types for each 
thesaurus term. Each indexer's behavior is compared against 
the two-clue- ANDed-model and the results summax:ized by an 
average figure -of -merit. The in dividual-ii,- dexer performance 
figures fcr a particular clue type are averaged to yield an 
over-all figure-of -merit for each pair of ANDed clues. This 
information is also retained fcr later use in the 
combinatorial model. 
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Next., the procedures described above for use on all 
single clues and all pairs of clues are repeated for all 
triplets and quadruplets of clues and the information saved 
for later use. Four was chosen as a maximum number for this 
ANDing step because it appeared to be well beyon<3 the 
complexity humans might use in clue selection. 

One would expect that much ANDing of single clues would 
eventually produce a yes/no indicator consisting of nothing 
but no*s or zeros, These clue combinations cannot help in 
the modelling since there are no terms which both the indetxer 
and the model agree to assign (that is, there are no case 
2'>s) , These unfruitful clue combinations are dropped from 
further consideration. 

At this stage in the procediire, we have produced and 
saved all possible ANDed combinations of single, double? etc. 
clues which might have seme value later on in the model. In 
order to have some value, the combinations must have shown 
evidence of at least one thesaurus ■ term for one document for 
which the ANDed clue combination correctly predicted that the 
indexer would assign the term. 



The next step is to test 
of the clues from the AND step, 
be OEed with all the other 
OEing takes place, the over-all 
for the new OEed combinatio 



all possible OEed combinations 
Each of the ANDed clues will 
ANDed clues. After each trial 
figure-cf -merit is calculated 
r: under test. After pairs of 
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ANDed clues have been ORed "together, triplets of ANDed clues 
are ORed, then quadruplets, etc. 

One would expect that much OBing of the ANDad clues would 
cilso eventually produce Icwer over-all f.igures-of -merit since 
the incidence ol; indexer -model agreement (that is, case 2, as 
discussed in the previous section) can only increase to a 
maximum of five or ten for each document, while the incidence 
of indexer model disagreement (that is, case 3 and case 4 as 
discussed in the previous section) could increase 
considerably beyond this. If OBiri;; produces new trial clue 
combinations with a decreased over-all figure of merit, 
further OBing of these clues is terminated. 



The end result of this sequence of AllDing and OBing is a 
group of eguatio'n^■ of the following form; 

(Cl) OR (C2 AND C3) OB (C4 ANE C5 AND C6) OR (4.03) 

where Cl through C6 are arbitrary clue types. Each ANDed 
element in the equation may be composed of a single clue, or 
pairs, triplets or quadruplets of clues ANDed together. Any 
number of ANDed elements may be combined with OB operators. 
Hence the equations, and each term within them, may be 
variable. 

Each of these Boolean equations is associated with an 
over-all figure- of-merit which summari5?;Gs how well that 
particular equation predicts the average performance of the 
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group of indexers. Because of ihe sequence of ANDing and 
ORing opetationsg, -these remaining Boolean egua-tions are 
guaranteed to have the highest f igure~of -Tjerit , This is 
therefore the set of equations which most accurately predicts 
how the indexers behaved on the average. It is the best set 
of models of human indexing behavior which we can build with 
the specified procedure . 



Ideally^ we wish to 
predict accurately how 
looking for the equation 
with the least number of 



obtain the simplest model which will 
humans index. We are therefore 
with the highest figute-of -merit and 
ANDed and OEed terms. 
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4.4 Statistical Tests of the C cmbina tcrial Model 



Statistical tests of the significance of the 

comb3.natorial model are much less complex than those for the 
regression model. The over-all f igure-of -merit for the 
highest ranking Boolean equation quantifies the amount of 
human indexing accounted for by the textual Clues. The 
f igure-of -merit for each of the- equations is simply an 
average of all indexer behavior, over all documents, for all 
thesaurus terms in the sample. To be able to make statements 
about the entire population of indexers, documents and 
thesaurus terms, from this sample, we use the Central Limit 
Theorem (Hays (1963) 238-244) to obtain a normally 

distributed population. For the Boolean equation with the 
highest average figure-of-msrit, we know how well the 
equation predicts the average indexing for each document- term 
pair. If randcm scores chosen from this large sample are 
averaged, a normal distribution is produced. From this 
normal distribution the standard deviation of the sample may 
be calculated. The confidence interval for whatever 
confidence coefficient we choose, can then be obtained. 



One of the major points of interest is a comparison of 
the scientist-indexers against the librarian- indexers to 
determine which group is most accurately represented by the 
textual clue model. If the figures-of -merit are calculated 
for the indexing of the scientist-indexer group alone, the 
Central Limit Theorem provides standard deviations just as it 
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did above for the total indexer group. The calculations can 
for the 1 i tr ar ia n- inde xe r group. 



The relative importance 
immediately available from an ob 
themselves,, it is of interest 
frequently used in the ANDed and 



of the textual clues is 
servation of the equations 
to know which clues are most 
OBed equations. 



The predictive properties of the Boolean equations are 
straightforward. A new document is tested for the existence 
of each of the clue types. These binary values are plugged 
into the Boolean equation. The decision on the assignment of 
each thesaurus term is "yes” if the Boolean equation returns 
a value of one, and "no" if a zero is obtained. 

As with the regression model, we must use caution when 
applying the model predict ively . The Boolean equation is not 
a universal automatic indexer just because it may account for 
the human indexing behavior on a sample of documents. There 
might well be special circumstances affecting our group of 
documents and indexers which render the model inaccurate when 
used on a radically different sample. 
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4.5 Computer Programs for the Combinatorial Model 

The computer programs discussed in this section were 
written in PL/I and run on an IBM 36C/65 . Assembly language 
subroutines were used to generate random numbers and to count 
the number of ones in a bit string. 

After the comparison of the document words wit the 
thesaurus, ~tion 5.2.5) .there ^ere a total of "2,440 
clue vector 'f these, 6C61 recorded no matches wit'i the 
thesaurus and lo indexer assignments for that particular 
index term. In other words, the entire clue and indexer 
vector was zero. ANDing and ORing of these all-zero vectors 
would not have affected the Boolean model, so they were 
eliminated from further processing as far as this model was 
concerned. From the remaining 6379 non-zero vectors, 2048 
vectors were chosen randomly with the random number generator 
proposed by Lewis, Goodman and Miller (1969) . This 
particular sample size was chosen because the IBM 360 
machines can perform Boolean operations on a bit string of 
length 2048 in a single machine instruction. 

The master vector for each of these 2048 observations was 
then read into core and organized in an array. This arra)/ 
was 60 bits wide (one bit for each clue type) and 2048 bits 
high (one bit'for each observation on the sample). Ti'se array 
was then transposed so that it could be efficiently handled 
in later Boolean operations. The same procedure was followed 
with the indexer array. Recorded in the master. vector set 
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was information about whether each indexer assigned a 
particular term, or did not assign it. Each indexer’s choice 
of terms then could be represented as an array one bit 
wide (one bit to indicate whether the term was assigned or 



not) and 


2Q48 bits high 


^ 


bit for 


each observation in 


th e 


sample) , 


Since 


there 


we. 


3 :welve 


indexers, the array 


was 


actua lly 


12 bits 


wide . 


This 




also transposed 


so 


that it 


could be 


compared ef 


i :ie "tly 


' ith the clue array. 





The ANDing program then r rcess d the clue and indexer 
array in the following manner. e clre array, now 2048 bits 
wide by 60 high, was read into rre. The indexer array, now 
2048 bits wide by 12 high, was also read into core. The 
program then tested the first clue against all twelve 
indexers, it did this by ANDing the clue vector with the 
first indexer’s vector and counting the number of one bits, in 
the 2048 bit string. Counting was dene with an Assembly 
language subroutine suggested by Baduchel (1970). The number 
of one bits in the ANDed string equaled the number of 
observations l~ which the cl'ie vector agreed with the indexer 

- that is, the number of case 2»s in the sample. This is the 
numerator of the figure- of- merit . The same first clue' was 
then OEed with the same indexer vector, and the one bits in 
the ORed string counted. The number of one bits in -this OEed 
string equaled the number of observations in which either the 
clue type or the indexer indicated a term should be assigned 

- that is, the number of case 2*s plus case 3*s plus case 4’s 
in the sample. This is the denominator of the figure-of- 
merit. This sequence of ANDing and OEing a clue vector with 
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the . -indexe r« vector was repeated for: each of the twelve 
indexers. The resulting average,, figure-of -merit was then 
stored = with the clue pattern and vector, for later use in the 
,Q1Ring program-. Of -course# if -the figure -of ^ it was zero# 
.the vector and the information about it were discarded-. 

After processing the first clue vector In this manner, 
the program then ANDed the first clue vector vith the second 
and tested the resulting vector against the indexer vectors. 
It then tried A”Ding in the third vector, and so forth. When 
the program had tried all possible ANDed combinations 
involvirjg the first clue vector, it then moved on to the 
second. This ANDing sequence was chosen to minimize access 
time in core. The result of this processing was a total of 
5572 ANDed vectors. The best of these vectors had a figure- 
of-merit of 0.11517. 

The OBing programs were organized in a similar manner, 
except that there was not enough core storage or computer 
time to handle all 5572 ANDed vectors. Tor this reason, the 
best 300 ANDed vectors were processed one at a time against 
the other ANDed vectors. OBing of pairs of ANDed vectors 
produced a total of 45,150 OBed vectors with a high figure - 
of-merit of 0,15051. OBing continued, one stage at a time, 
until a maximum of eight ANDed clues had been OBed together. 
The vector with the highest figure-of -merit was separated by 
sorting and is discussed in Chapter 6. 
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The confidence interval around the best f igure-of -mer ' t 
was obtained by taking random selections of 32 observations 
from the 2048 observations in the final best vector. The 
individual fig ures-of -merit fcr each of these smaller groups 
were calculated and the results used to compute — 
ccnfidence interval. 





Chapter. Five 



EXPEEIMENTftI PEOCEDnEFS 
AND SAMPLES 
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5. Experimental Procedures and Samples 



5.1 The Documents and Indexers 



A group of scientists and engineers f see Section 2.6) 
with experience in the field of instrumentation was available 
to serve as scientist- indexer subiects. To cater to their 
field of specialization, all documents indexed by any of the 
following terms were selected from the 1969 subject index of 
U.S. Government Research and Development Reports (OSGEDR) ; 
acoustic measuring instruments, aircraft instruments, 

astronomical instruments, charge measuring instruments, 

electrically powered instruments, electric measuring 

instruments, meteorological instruments, optical measuring 
instruments, pneumatic instruments, radiation measuring 

instruments, recording instruments, spacecraft instruments, 
strain measuring instruments, surveying instruments, 
temperature measuring instruments, thermal measuring 

instruments, time measuring instruments, voltage measuring 
instruments. These terms are the set of descriptors with 
•instruments” as the last word with two exceptions, surgical 
instruments and musical instruments, which were not included 
because they fell outside the usual range of instrument 
subject expertise for the individuals involved. 



The 1969 USGRDR indexes contained 78 documents indexed 
under the above terms. These documents were arranged in 
ascending order by the report number. A random number table 
was used to se3.ect twenty documents to serve as a test 
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sample. The complet.e information for each of ■these -twen-ty 
documents was -then keypunched directly from the USGPOR entry 
(see Section 5,2,1 for de'tails) , Only the title and abstract 
were used in the experiments discussed here. Hereafter- the 
word ’'document” means only the title and abstract of the 
document as those titles and abstracts appear in DSGEDF. 

Two groups indexed each of the twenty documents. The 
first group consisted of the six librarian-indexers and the 

I 

second, the six scientist -indexers. Each indexer was given 
the same set of materials from which to work. This sat 
consisted of 1) the titles and abstracts of each C'f the 
documents to be indexed in a s'tandard printed. format,. 2) 
indexing instructions and 3) the Engineers Joint Council 
(EJC) Thesaurus of Engineering and Scientific Terms (1S67) , 
The standard document format was produced by a computer 
program which arranged each document on the page so no words 
were broken at the end of a line. Some standard information 
was printed at the bottom of each page. The documents were 
printed on alternate pages so the indexer could see only a 
single document at a time. See Figure 5,01 for a reduced 
copy of one page of this printout. 



The instructions to the indexers are reproduced in Figure 

5.02, A page from the EJC Thesaurus is reproduced in Figure 

5.03, The terms chosen by the indexer for each document were 
keypunched and a computer program then collected the 
individual index sets for each of the documents and for each 
thesaurus phrase assigned. This pregrant provided the "terms- 
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assigned.*' information for the programs discussed in Section 
5 . 2 . 





5,2 Clue Counting Procedures 



The problem of finding,, identifying and counting 

particular types of clues in natural language text is common 
to both of : the indexing models used in this thesis-. When 
even moderate numbers of clues must be located# the task 
becomes much too tedious to be done, accurately by hand. For 
this reason# computer programs were written to find and count 
each clue type. All of the .coro,puter programs discuss* in 

this section were written in PL/I and run on an IBM 360/65. 

5,2.1 Keypunching 

Each document in the sample was keypunched# proof-read 
and corrected. In general# the text was keypunched exactly 
as printed, Excejpticns to this rule were caused by the 
limited keypunch character set; 

1 If the document contained a character not on the 
keypunch# the word for that character was 
substituted. This rule was very seldom needed. 

2 When words were broken with a hyphen over the end 
of a justified line of printed text# the hyphen was 
dropped and the word ”glued together” again in the 
keypunching. 

3 Subscripts and superscripts were keypunched on the 
line with the text. 

4 All lower case letters in the printed text were 
keypunched as upper-case characters. 
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Figure 5.04 shows -the original version of one of the 

documents. The machine-printed version of this document is 
shown in Figure 5.01, 

A program was written to isolate each word- in the running 
text. This program considered a word to be any sequence of 
the alphabetical characters (A...Z) unbroken by a non - 
alphabetic character (0 123. . . 9r : ; /«retc.) , Since none of the 
thesaurus terms contained non-a Iphabetic characters, this 
procedure did not discard any potential matches. Each of the 
single words we\s written on a sequential file with 
information on the document being processed, the location of 
that word in the document (title or abstract) and the 
relative position of the word in the document (counting the 
first word in the document as one, the second word as two, 
etc. ) . 



5.2,2 Reduction to Singular Form 



The matching procedure detailed in later sections of this 
chapter considers singular and plural forms of a word to be 
equivalent. Each of the words isolated in the previous 
section was tested for the ending •ies®, *es* or ’s». If a 
word ended in *ies*, this ending was changed to a *y*; if the 
word ended in * , the was dropped; if the word ended in 
•es* the ending was dropped after slbilantr. (*s*, *ss*, *c*, 
•sh*, etc.). Exceptions to these general rules were 
programmed individually. For instance the singular forms of 
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•pulses® and ® mars® do not follow the regular rules and were 
therefore handled as exceptions. 

Since the comparison had to be made between the document 
and the thesaurus, the same procedure was followed for the 
words from each of the thesaurus descriptors. Figure 5.05 
shows the singular form of some words from the document in 
Figure 5.04. 

5.2.3 Stemming 

The root segment of each of the words was then found with 
the stemming algorithm suggested by Lovins ((1968) . This 
algorithm searches for the longest match in a list of endings 
ordered by length. If a match occurs, and if context- 
sensitive conditions associated with that ending are 
satisfied, the program strips the ending from the word. The 
resulting stem is then additionally transformed with recoding 
rules which handle spelling exceptions. 

To minimize search time, the list of endings was hashed 
with the division method (see Lum, Yuen and Dodd (1971) for a 
comparison and review of various hashing techniques). A 
number stored at the hash location pointed into a separate 
table which resolved clashes and itemized the context- 
sensitive conditions to be satisfied for each of the endings. 
If the conditions were satisfied, the recoding procedures 
were invoked. The resulting stem was then paired with the 
original word in a record comprised of document number. 
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location^ and relative position. Figure 5.05 also shows the 
stemmed form of some words from fhe document in Figure 5.04, 
The appendix summarizes the additions and changes to Lovin' s 
endings, conditions and recoding rules necessitated by the 
Vocabulary in our sample. 

5.2.4 Thesaurus Terms Used in the Models 



The Engineers joint Council Thesaurus contains 17,810 
descriptors. Most of the thesaurus would have no matches 
with any sample document and would not be assigned by any of 
the indexers* ThUsr most of the thesaurus could reasonably 
be expected to have an indexer and clue vector consisting 
entirely of zeros. These experimental points would be 
useless for this investigation. For this reason, the size of 
the thesaurus was reduced for processing in the following 
way. First all index terms assigned by any of the Indexers 
un Y of the documents were included in the thesaurus. 
There were 430 of these terms. This group of terms includes, 
for any particular document, all the clue vectors which have 
non-zero indexer values. 



To include other vectors with guaranteed non-zero clue 
values in the vector, a sort and count was made of all the 
words jn all the documents. Omitting function words such as 
* as * , • a® , ’the®, the most frequently used words were used to 

search the complete EUC Thesaurus for descriptors containing 
these words. Descriptors containing these frequently used 
were added to the first group of 430 descriptors. Note 
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t-hat. -this procedure forces fhe models to atccount for, not 
just the as sig n ment of descriptors, but also the non ~ 
as signme nt of likely descriptors. This choice makes the 
model more conservative in ascribing machine- like behavior to 
the humans. The final mini- thesaurus contained 622 terms. 

5.2.5 Document-Clue Matching Procedure 

fls discussed in Section 2.5, matching phrases, synonyms, 
words and roots in the thesaurus and in each document were 
counted to produce what we are calling a "clue vector". For 
each document -descript or pair, this clue vector summarizes 
the aumber of tiroes each clue type appears in the document. 

Information on the number and types of clues existing in 
each document was obtai.ned from a program which compared each 
descriptor in the mini -thesaurus against the words of each 
document in the sample. The program first hashed a 
document’s words into core storage. A single thesaurus 
phrase was then read in. It was compared with the words of 
the document by hashing the thesaurus words and searching for 
matches with the hashed document words. If matches did 
occur, the clue vector for that document-thesaurus pair was 
updated with the appropriate information and the program then 
read in the next thesaurus phrase. After the entire mini- 
thesaurus has been compared with the first document, the 
hashing locations were cleared so that the next document’s 
words could be processed. This program processed a total of 
12,440 vectors for the documents in about 30 minutes. 
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5,3 Sub-Samples Tested 

The result, of the process.lag described in Section 5. 2 is 
a set of 12,440 clue vectors,^ 622 clue vectors for each of 
the 20 documents in the sample. We will call this set the 
•master set’. Of the 12,440 vectors, 6061 were completely 
zero in both . indexer assignments and clues; 6379 were non- 
zero in at least one portion of the record. 

Since the difference i^ indexing nehavior between 
scientist-indexers and librarian-lrdexers is of considerable 
interest, two new sets of cluf= ’•sectors each were 
produced for ..these two groups or . 'lexers. Each of the new 
vectors sets was based on he indexing done by the 
appropriate indexer group. 

Several other subsets were taken. Since many of the 
studies in Chapter 2 considered only the terms assigned by 
the indexers, a subset of vectors was made by separating only 
those terms which were assigned by at least one of the 
indexers. These vectors should show greater evidence of 
”machine-like” indexing than the rest cf the master set, if 
the effects noted in Chapter 2 hold, A second subset was 
made by separating only those terms assigned by two or more 
indexers. 



It is also of interest to know how each document and 
indexer varies from the average. Information on the 
documents is obtained by processing the clue vectors for each 
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document separately. Information on each indexer is obtained 
by re-running the entire model with clue vectors based only 
on the indexer in question. These runs will characterize 
individual documents and indexers in detail. They might, for 
instance, reveal a group of documents which aire wodelled 
extremely well, and a group which are not modelled 
sucessfully. Further inspection of these documents may r.elp 
to explain the success or failure of the model. Five 
documents were selected randomly for individual processing: 
documents 1, 2, 6, 14 and 20. Four indexers were selected 
randomly for individual processing: 4, 0, 7, 1 i „ 
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5o4 iia-tlstics Describing the Docaifienrs and Indexers 

To give the reader a feeling for the document s-mple, 
some numerical parameters summarizing the incidence of clues 
have 'Teen tabulated in this section. Figure 5.06 gives 
information on the lengtn of the documents in the sam pie. 
Figure 5.07 lists the number of terms which were assigned by 
from n.e to twelve indexers. For example, on document 4, 
eleve". of the indexers agreed one of the terms she id be 
assig ad, while there were 26 terms assigned by just a of 
the ::_ndexers. Figure T . 08 summarizes the number c:: uimes 
each jlue type , occurred in the entire document sample. Note 
that the number of clues occurring in the tntle were always 
less than the number occurring in the abstract. This is 
because the title was short in comparison to the abstract. 
Figure 5,09 gives the distribution density of all clue types 
in each of the sample documents- For instance, of the 37,320 
possible clues for each document (622 thesaurus terms times 
60 clue types) 508 clues appeared once each in document 1, 
However, 61 clues appeared four times each in document 1. 
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Figure 5.01 Document from the printout used by th~ itidexers 
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Figure 3.02 Tnstruc-tions for -the indexers 



TNFTF~ .'Tr^HS 



T:;atrine that, you are a professional indexer for STAB (Scientific 
ard Tec. nical Aerospace Reports) or DSGBDB (U.S. Government Research 
and development Reports) • Both of these indexing and abstracting 
journals are distributed internationally to engineers and scientists 
interested in current information in their fields. 

cn each page of the enclosed printout is a document. Below the 
document are numbered blank lines on which you are to record 
ycrc cho.ice of indexing terms for that document. The 

ter .11 s must be chosen f rom the e nclosed E J C Thes aurus of Engineering 
and : uientif ic Terms. (If you are not familiar with this thesaurus, 
see description following these instructions.) Space is 

provic 5.3 for up to ten indexing terms. If you wish to 
asi:;, r more than ten terms to a document, simply write in 
the additional terra55 at the bottom of the page. 

You may find it helpful to note the important subjects while 
reading over the cocuarnt. You may use the space below the 
document for this purpose. The thesaurus can then be used to 
rephrase these subjects into the appropriate index terms 
on the numbered lines or below them. 

Choose the most appropriate (applicable or useful) terms from 
the thesaurus for each document. Any number of terms may be 
assigned. Be as specific as possible in assigning terms. Remember 
you are indexing for engineers and scientists who will want to find 
these documents for their own research. The terms you assign 
should enable them to locate pertinent information quickly. 

Please keep track of the time you spend indexing. Use the right- 
hand side of the printout page to record each time you begin 
indexing in hours and minutes, for instance: begin 4:32. When 
you are interrupted or have to griit, record the end time as: 
end 5:25. This job should be least imposing if you choose a 
time and a place permitting extended periods of concentration 
without disturbance. 

In summary, you have two tasks: 

1) Assign the best index terms to each document, 

2) Keep track of all time spent indexing. 

If you have any questions at any time, please call me 
collect at home (415-327-0727) or at work (4C8-227-7100 
ext. 5435 or ext. 5611) . Many thanks for your help. 




ft a 



rj 



iJiJ . 



Caryl 



Se02 Instruct-ioiis for the indexers (continued) 



?1PT10N OP THE EJC THESA0RT3S 

' V70 sections of the thesaurus have been marked with tabs, 
first section lists all index phrases in alphabetical 
-iz character-by-character ignoring spaces and punctuation, 
e that this is not the usual alphabetical order. 

s” appears b efore ”Band saws” because the blank in the 
word phrase is ignored.) This section of the thesaurus 
' suggestions for broader terms (BT) , narrower terms (NT) 
' terms (PT) . These additional terms may be useful in 

■ ing the best indexing terms for the document. 



" he second section 
every word used 



_on. You 
ocate all 
deviations 
the bottom 



of the thesaurus lists, in alphabetical 
in every index phrase in the first 
will find this section helpful if you would like 
index phrases containing a particular word. 



used in 
of each 



both sections 
page. 



are explained in footnotes 



and 
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Figure 5.03 Page from the Engineers Joint Council Thesaurus 



USE 



Interplanetary dust 0301 
Smaller than micrometeoroids 
UF Meteoroid dust 
BT Interplanetary medium 
RT Micrometeoroids 
Space hazards 

Interplanetary flight 2201 
BT Space flight 
RT Astrodynamics 
— Orbits 

Spacecraft guidance 
Space exploration 
— Space navigation 
Interptanetary matter 
USE Interplanetary medium 
interplanetary medium 0301 
UF Interplanetary makter 
NT interplanetary dust 
RT Interstelirr matter 
— Meteoroids 
Micrometeoroids 
Solar atmosphere 
Solar wind 
Spacecraft debris 

Interplanetary navigation 1707 
2201 

BT Navigation 

Space navigation 
RT Celestial navigation 
Radar navigation 
— Radio navigation 
interptanetary plasma 
USE Solar wind 

Interplanetary probes 2202 
Unmanned vehicles for interplanetary 
missions; for manned interplanetary 
vehicles see lr\terplanetary spacecraft 
BT Spacecraft 

Space probes 
Unmanned spscscraft 
NT Mars probes 
Venus probes 
RT Deep space probes 

Interplanetary soacecratt 
Lunar probes 
— Planets 

InterplaneMry sp2:ce 0301 
RT Aerospace environment 
Interplanetary spacecraft 2202 
Manned vehicles for interplanetary 
missions; for unmanned 
interplanetary vehicles see 
Interplanetary probes 
BT Manned spacecraft 
Spacecraft 

RT — Artificial satellites 
Deep space probes 
— interplanetary probes 
Lunar spacecraft 
Mars probes 
Rendezvous spacecraft 
— Space probes 
Space stations 
Venus probes 

Interplanetary trajectories 2203 
BT Spacecraft trajectories 
Trajectories 

RT Circumiunar trajectories 
Earth moon trajectories 
Parking orbits 
Planetary orbits 
Rendezvous trajectories 
Transfer orbits 
Interpolation 1201 
BT Numerical analysis 
NT Divided differences 



InteiTorcter routines 0902 

BT Computer progratns 

Computer systems programs 
RT Assembler routines 
Compilers 

— Opfcmtmg systems (computers) 
Simulator routines 
Translator routines 
Interpreters 0902 
NT Punched card interpreters 
RT — Punched card equipment 
Interrogation 0502 
RT— *Data processing 
— Intelligence 

Interrogator Ci^insmltters 1 702 

BT Radio equipment 
Radio transmitters 
Transmitters 
RT— Radio receivers 

Radio transponders 
Interrupters 0901 
BT Control equipment 
Electric switches 
Rl — Circuit breakers 
Circuit protection 
— Electric relays 
Vacuum switches 
Intersections 1302 
No grade separation 
UF Grade crossings 
tRallroad crossings 
NT Interchanges 
RT Crossings 
— Highways 
Ramps 
— Roads 
Streets 

/nferserv/ce support 

USE Joint operations 
and Logistics operations 
and Logistics support 
Interstate highway system 1302 
RT— Cargo transportation 
Highway transportation 
Interstate Iran sporiation 
— Limited access highways 
Interstate transportation 1505 
BT Transportation 
RT — Air transportation 
— Cargo transportation 

Commercial transportation 
Common carriers 
Highway transportation 
Interstate highway system 
Passenger transportation 
Petroleum transportation 
— Pipelines 

Pipeline transportation 
Ftail transportation 
— Water transportation 
Waterway transportation 
Intrfrststtar flight 
USE Space flight 
Interstellar matter 0301 
RT — Celestial bodies 

Cosmic gas dynamics 
— Interplanetary medium 
— Nebulae 
Interstices \407 
RT Capillarity 
— Cavities 
Filterability 
Fluid infiltration 
Percolation 
Permeability 
— Porosity 
Voids 

Interstitials 2002 1106 

RT — Additives 

— Crystal defects 
Crystal structure 



Intestinal atresia 0605 
BT Congenital abnomnalities 
Gastrointestinal diseases 
Intestinal dissasas 
USE Gastrointestinal diseases 
Intestinal obstructions 0605 
NT Intussusception 
RT Adhesions (intestines) 
Appendicitis 
— Benign neoplasms 
Constipation 

— Gastrointestinal diseases 
— Hernias 
Inflammation 
— Neoplasms 
Peritonitis 
Entestines 0616 
BT Digestive system 

Gastrointestinal system 
NT Colon (Intestines) 

Duodenum 

Ileum 

Jejunum 

RT Appendix (intestines) 
Intracellular potential 0605 
RT — Eiectrophysiologic recoiding 
Intracranial 

electroencephalogr&phy 0510 
0605 

BT Electroencephalography 

Eiectrophysiologic recording 
RT Scalp electroencephalography 
intramuscular Infusions . 

USE Parenteral Infusions 
Intrastate transportation 1305 
BT Transportation 
RT — ^Alr transportation 
— Cargo transportation 

Commercial transportation 
Highway transportation 
Passenger transportation 
Petroleum transportation 
— Pipelines 

Rail transportation 
— ^Water transportation 
Waterway transportation 
intravenous infusions 
USE Parenteral infusions 
Intrinsic viscosity 2004 
BT Rheological properties 
Transport properties 
Viscosity 

RT Dynamic viscosity 
Kinematic viscosity 
Relative viscosity 
Saybolt viscosity 
Intrusive rocks 0807 
UF Abyssal rocks 
Plutonic rocks 
BT Igneous rocks 
Rocks 

NT Diabase 
Diorite 
Dunite 
Gabbro 
Granite 
Magma 
Monzonite 
Pegmatite 
Peridotite 
Porphyry 
Quartz diorite 
Quartz monzonite 
Syenite 

RT — Sasic rocks 
Phanerite 



— Use preferred term; UF = Used For: BT = BroaderTerm: NT = Narrower Term; RT = RelatedTerm. 
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Figure 5.04 Printed version cf the document in Figure 5.01 



N68'38439*/^ TRW Systems Group. Redondo Beach. Calif. 

FLIGHT PROTOTYPE MODEL METEOR FiLASH ANALY;CER 
Final Report 

F. N. Mastrup and C. O. Bass Apr. '^968 195 p refs 
(Contract NAS9-0532) 

(NASA-CR-92364: TRW-0520Z.601 5-ROOO) CFSTI: HC$3.CK)/MF 
S0.65 CSCL 14B 

A flight prototype Meteor Flash Analyzer with a three-channel 
radiometer was designed, constructed, and tested. Each channel 
has video outputs to measure the intensity vs. time variation of 
individual meteor flashes: and there are a total of 9 meteor data 
channels for making related measurements. The long wavelength 
(iron) channel nearly coincides with the conventional spectral range 
for photographic meteors, providing correlation with ground-based 
observational data. Detection sensitivity for terrestrial meteors in 
the iron channel is background radiation limited: and this appears 
to yield superior sensitivity for the optical detection of meteors in 
wavelength bands below the oxone limit at ^0.30 n. For satellite 
altitude of naut m. detector field of view of 30“. and detector 
aperture dia of 5 cm. limiting photographic meteor magnitude wcis 
-1-3.3, with an inverse count rate of 5.G min/meteor. At 600 naut 
m. count rate is expected to be 22 min/meteor with a magnitude 
of -1.1. Significantly larger count rates are expected for the 
magnesium and silicon channels. M.W.R. 



Figure 5.05 Singular and st-emined forms of some words from 

■the document in Figure 5.04 



singul ar form 



stemmed form 



flight 

prototype 

model 

meteor 

f las h 

analyzer 

designed 

c on s tructed 

variation 

measurement 

photographic 

providing 
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sensiti vit y 

detection 

detec tor 



provid 
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sensit 
detect 
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construe 
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measur 
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mode 1 

meteor 

flash 
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design 
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flight 
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Figure 5«06 Length of the sample documents 



Docufiisnt Number of characters Number of words 

in document in document 
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957 


1 39 


02 


663 


89 


0 3 


931 


136 


04 


1164 


1 66 


0 5 


1187 


179 


06 


999 


1 40 


07 


120 8 


1 69 


08 


606 


86 
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1 54 


1 0 


2 67 


33 


1 1 


852 


127 
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1078 


1 53 
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10 2 2 
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15 
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17 
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1 84 
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19 
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20 
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64 
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Figure 5.07 Distri^but ion of Indexing Consens^ue on sample 
Average number of indexers assigning a term; 2.31 
Standard deviation: 2.23 
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Figure 5.08 Number of clue cccurrences in Entire Sample 



Type of 



H atch 


MN 


ST 


US 


EE 


NE 


EL 


Total 
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3 A 
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2T 
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1 4 


2 A 


61 


73 


17 
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54 


105 


320 


HT 


3 2^ 


4U 8 


149 


243 


600 


1 492 


3256 


HA 


22 16 


3246 


1141 


1 845 


4509 


11758 


24715 


M2T 


215 


297 


84 


180 


285 


988 


2049 


H2 A 


1632 


,2245 


634 


10 87 


2593 


7352 


15543 


PUT 


1 8 


3 1 


10 


25 


38 


68 


190 


MIA 


220 


283 


139 


220 


498 


P08 


2168 


Totals 
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663 1 
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8579 


22572 


48258 


(See Figure 6. 


01 for an explanation of 


acrony ms 


.) 
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Figure 5.09 Distribution of All Clue Occurrences in Documents 
Number of Occurrences/Eocumcnt for All Clues 
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chapter Six 
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6« Conclusions 



6,1 Introduction 



This chapter discusses results of the models 

described in Chapters 3 and i* • simplify the foiloving 

discussion and to save rePe^tihg long names, each of the 
clues has been assigned a bti^f descriptive name. These 
acronyms are listed in Pi9uie 6«0l (pages 122 and 123) 
together with a fuller descriPtioJi of the clue. 
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6.2 Evidence For and Against Hachine-Like 

6v2.1 Besults o£ the Multiple Linear Regression 

Information about the major regression runs is surmarized 
below. 

All Indexers 

6379 experimental events 

47 clue types with correlation greater than .0001 
with dependent variable 
multiple correlation coefficient (R) ; 0.5386 
square of correlation coefficient ; 0.2901 

99% confidence interval for R: .5153 to .5611 

Librarian Indexers 

6379 experimental events 

45 clue types with correlation greater than .0001 
multiple correlation coefficient (R) r 0.5364 
square of correlation coefficient (RZ) : 0.2877 
99% confidence interval for R; .5130 to .5590 

Engineer and Scientist Indexers 
6379 experimental events 

46 clue types with correlation greater than .0001 

multiple correlation coefficient (R) ; 0.4674 

square of the correlation coefficient (RZ) • 0.2184 
99% confidence interval for R; ',4418 to ,4923 
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10-3 



Perhaps the most dramatic result 
samples taken shows either a very strong 
correlation between the descriptors 
documents. At least for our •‘sample? 
accounts for about thirty percent 
a ssignments. 



is that none of the 
or a very weak 
assigned and the 
linear re-^ression 
of the indexing 



As expected, the librarians* indexing could be predicted 
more accurately from the clues than could the indexing of the 
engineers and scientists. The difference was significant at 
the 99% level. The inexperience cf the engineers and 
scientists with indexing and with the thesaurus may have made 
them much more dependent upon word-for-word matches between ' 
the descriptors and the document than otherwise might have 
been the case. Hence our results are probably conservative. 
Differences between librarians and engineers or scientists 
might be more pronounced ' under other experimental conditions. 



It is difficult to compare the results of our multiple 
linear regression model with results obtained from previous 
studies because of a number of differences in the studies. 

First, there is presently not enough data on how the 
subject content of the sample documents affects the results. 
The documents used in our sample were in some ^instances 
highly technical discussions of rather specific engineering 
problems. The subject field of our documents compares most 
closely with the studies done by Slamecka and Zunde (1963) . 
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Unf oriiunately p they made only a ourf^ory exam: :al > oi a 
data frotit the viewpoint of machine-like indexing. 

Secondly^- there is the problem of the size of the sample 
of indexers, we used a total of 12 indexers, six librarians 
and six engineers or scientists. All twelve indexed each of 
the sample documents. No previous study had such a large 
group of indexers. 

Thirdly, most previous studies did not account for the 
non -assignment of index terms as discussed in Chapter 2. The 
effect of looking only at assigned terms is demonstrated by 
re-running the regression on only these experimental events 
which have an indexer value above zero. Two runs were made. 
In the first, a term had to be assigned by at least one 
indexer to be included in the regression; in the second, a 
term had to be assigned by at least two indexers to be 
included. Information about these two sub-samples is given 
below. 

At least one indexer assigned each term 
591 experimental events 

46 clue types with correlation greater than .0001 
multiple correlation coefficient (R) • 0.5486 
square of the correlation coefficient (B®) : 0.3009 
95% confidence interval f or E : .4896 to .6026 
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At least two indexers assigned each term 
264 experimental events 

41 clue types with correlation greater than .0001 
multiple correlation coefficient (R) : p.5585 

t 

sguare of correlation coefficient (B^) : 0.3120 

95% confidence interval for E: ,4694 to .6363 

As expected^ increases as more of the indexers agree 

to assign a particular term but the results are not striking 
or significant. Because of the small number of experimenta ]. 
events in which a majority of indexers agreed, and because of 
the large number of independent variables, the confidence 
intervals for these coefficients is considerably larger than 
for the full sample size. 



Fourth, although all of the above regressions tested the 
effect of sixty possible clue types cn the indexers, they 
still could account for only about a third of the variance in 
the indexing. This is in contrast to earlier studies which 
took only one or two clue types into account, but which did 
not consider the non-assignment of index terms. The effect 
of the small number of clue types would, in general, be to 
decrease the correlation between the clue types and the 
indexing. The inclusion of only assigned index terms would 
tend to have the opposite effect. This is probably why our 
numerical results are very roughly comparable to some studies 
done with fewer clues and based on assigned terms. 
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Lastlyp it is also possible that there is a theoretical 
maximum to the amount of indexing which can be matched wxth 
document words. For example, we can imagine a thesaurus 
which was specifically designed for a particular group of 
documents. This imaginary thesaurus might contain only words 
and phrases abstracted directly from the documents 
themselves. In this case, there is little opportunity for 
the indexer to assign a term not already in the document. We 
could also imagine a second thesaurus which made it a rule 
never to use a document word or phrase as a descriptor. 
Although such a thesaurus would probably be very difficult to 
compile, it would guarantee that there was no correlation 
between the index terms assigned and the words or phrases in 
the documents. 



The EJC Thesaurus obviously lies somewhere between these 
two extremes, it is quite possible, therefore, that there 
could only be a certain number of matches between the terms 
and the document words simply because cf the nature of the 
documents and the thesaurus. The extent of this theoretical 
limitation on the amount of the potential match between the 
documents and the thesaurus might account for differences 
between results obtained by different experimenters. 
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6.2.2 Besults of the Combinatorial Model 

The combinatorial model is based on Boolean combinations 
of the sixty clue types. Details of the best Boolean 
equation (produced according to the procedure described in 
Section n, 3(> are given below: 

Best Boolean equation 

Sample size: 2048 

Figure -of -merit (non -trivial 
Standard deviation of figure 
Fraction of all predictions 
non-tri\'ial assignments) 

Case 2*s: 125 

Case 3*s and 4*s; 651 

Xn terms of programming^ the Boolean combinatorial model 
was time consuming and difficult. Despite careful program 
design and coding, it took over an hour of cpu time on an ri?M 
360/65 to OE 40, COO pairs of vectors, calculate a figure-of- 
merit for each, and write the results on tape. Similar run 
times were required for each stage of ANDlng and OEing . 
Because of these very lor.j computer runs, the combinatorial 
model is not exhaustive. Instead, as discussed in Chapter 4, 
the best 300 vectors from the previous stage were used to 
calculate vectors for the succeeding stage. 

Another limitation of the combinatorial model was the 
practical limitation on the recording of count information 

>4' 

! 
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assignments): .1611 

-of-merit: ,0353 

modelled (trivial and 
: .6821 
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for each -type of clue in a document.. Equation 2.02 is based 
on a zer o/not -sero decision. Thus there is no difference in 
the binary record between a clue type which occurred just 
once and one which occurred many times. Once again, this is 
a practical decision necessitated by limited computer time. 
The lack of clue count information, however, makes this model 
less rich than the regression model. 

A limitation on the sample size for the combinatorial 
model was also made for computational reasons. However, the 
particular sample taken was verified with the regression 
model by running that model with both the litiited and the 
full data. The regression coefficient for the smaller sample 
of 2048 was 0.5387, just 0.0001 larger than it was for the 
sample of 6397. The limited Sample of 2048, therefore, is 
representative of the full sample size of 6397. 

The f igure-of -merit based on non-trivial assignments (see 
Section 4.2 for definitions of ’’non -trivial and ’*figure-of- 
merit”) is quite low. There were only a few case 2*s in the 
best vector and a number of case 3*s and 4*s. As discussed 

in Section 4.4, the standard deviation of the f igure-of-raerit 

/ 

was calculated by making use of the Central Limit Theorem. 
Sixty-four samples of thirty-two each were chosen at random 
from the large sample of 2048. This produced the standard 
deviation of the figrre-of -merit of .0353. 

The goodness of the model can also be judged in terms of 
the number of case 1*s and 2*s divided by the total number of 

Er|c 115 
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cases® This is the secona f igure -of -merit (called "fraction 
of all predictions modelled") introduced in Equation 4.02 in 
Section 4.2 This means that out of 2048 possible indexing 
decisions, the combinatorial model duplicated 68% of the 
indexers* decisions. This method of calculating this number 
is more comparable with the regression model and will be 
discussed in Section 6.2.3. 



It is unfortunate that more computer 
available. It would have been interesting t 
combinatorial model for the sub-samples u 
regression model and to compare the results. h 
from the discussion of the relative importan 
Section 6.3, the combinatorial model has a 
interpretation of indexer behavior than does 
model. Perhaps further refinement of the progr 
elimination of less valuable clue types may ma 
to include count information and larger sample 
future version of the combinatorial model. 



time was not 
o repeat the 
sed with the 
s can be seen 
ce of clues in 
more direct 
the regression 
amming and the 
ke it possible 
sizes in a 



6.2.3 Comparison of the Results of the Models 

Primarily because the Boolean model did not make use of 
the clue count information in the documents, and because 
"best" was defined differently in the two models, there is no 
simple, direct comparison between the two models. To make 
the figures from the two models somewhat more comparable, a 
second f igure -of -merit was calculated for the Boolean model. 
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This is the number rec-ir. e ' above and discussed in Section 
4.2 as '"fraction of all predictions modelled”. 



Both the combinatorial and the regression models were- run 
on the same sample of 2048. The combinatorial model 
accounted for 68% of all indexer decisions. That isr of the 

2048 decisions, there were 1397 decisions in which the model 

correctly predicted what the indexers assigned. The model 

assigned when the indexers did, and did not assign when they 

didn’t. For the same sample, the regression model had an Rz 
of 0,3009, In other words, approximately 30% of the variance 
in indexing could be accounted for by the regression, in 
view of the different ways in which these two percentages 
were calculated, the amount of indexing accounted for by the 
two models may be comparable. The lower percentage obtained 
from the regression is probably due to the linearity assumed 
by this model. 

In Section 2.5.3 we discussed the assignment rules tested 
in the Boolean and regression models and pointed out that in 
some special cases the two models are equivalent. Each of 
the four Boolean equations was tested for this equivalence 
(that is, linear separability) with the Biswas (1971) method. 

V. 

5 None can be realized with a single threshold element. Hence 
there is no direct mathematical equivalence between the two 
models. 



as 



Neither of the models performed wall enough to be useful 
a substitute for human indexing, A discussion of 
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prediction vrith these models has*- therefore*, been omitted. 
However*, the values of A and of the B’s for the first five 
steps of the regression are tabulated in Figure 6.05. Notice 
that as each new variable is added the previous values of the 
constant and of the B*s change. The regression is adjusted 
at each stage for the best fit, changing the coefficients for 
the variables at each stage. As an example, let us take the 
fifth step in the regression. All the variables are 
positively related to Y. The higher the number of 

occurrences of each of these five clue types, the more likely 
the indexer to assign the descriptor. On the average, the 
number of indexers assigning a term increases by one unit for 
each three additional occurrences of a twc-word main term in 
the abstract, by two units for each additional occurrence of 
a stemmed header in the title, and so forth. 

The constant and coefficients for the full regression 
equation are tabulated in the right-hand column of Figure 
6.02. Since the regression equation accounts for such a 
small percentage of indexer performance, this tabulation is 
not of much practical value. 

In summary, then, at least for this sample and this 
rather large group of indexers, we cannot model very much of 
echnical indexing with either a regression model or ai 
Boolean combinatorial model. Until we kno" more about 
differences between technical fields, the effect of the 
thesaurus on the indexing, etc., it is inva.lid to a: ;ue that 
indexers in general act in a mechanical manner. 
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6.3 Relative Importance of the clue Types 

We have some specific evidence about the relative 
importance of the clue types from each of the models. m 
addition, we can compare the clue types important in the 
engineer/scientist regression with the clues important in the 
librarian regression. (See Section 2.5,1 for a definition of 
each clue type and Pigure 6.01 for a table of all clue types' 
and their acronyms.) 



We can make no statements about the value of some clues 
in predicting indexing assignments because these clues did 
not occur in the sample. There were no title occurrences of 
any three-word descriptors, or of use, broader, narrower or 
related three-word terms in the abstract. Nor were there any 
two-word title occurrences of use or broader terms in the 
document,, sample. (See Figure 5.08 for a summary of clue 
occurrences in each of the documents.) Note that these would 
be the document-thesaurus matches least likely to occur in 
any sample because the match criterion was the most stringent 
(two and three word matches in the title and three word 
matches in the abstract). s^ote also that a high frequency of 
a clue type does not mean that the clue is necessarily 
important in predicting indexer behavior.-. There do,, however, 
have to be enough cccurreuces of a clue type to make that 
clue of practical value in the prediction. 

Figure 6,02# 6.03# and 6.Q4 list the clues ch have a 

correlation greater than .0001 with the dev;.?n variable 



ERIC 



119 



i 13 



for each of -the three major sampj.es ; all inde^cers librarian 
indexers, and engineer/scientist indexers. These figures 
also show the at each step and the increase in B^ caused 
by the addition of each variable to the regression. Although 
the order of importance varied from sample to sample, the 
same clue types tended to be at the top of the list. 



In all runs, a match of a two-word descripitor in the 
abstract was the most important of the clues. This’ single 
clue accounted for 63 to 75% of the final value of B. Other 
clues consistently occurred in the top group of all three 
regresi jn runs. They were modifier2 of use references in 
the abstract, two-word use references in the aV:>stract, 
modifierl of broader terms in the title and modifier2 of the 
stemmed term in the abstract. Although, mai*^ and stem two- 
word terms in the title, main three-word terms in the 
abstract and modifierl use references in the title were also 
in the top group, they are less important because they 
occurred infrequently in the sample. Thus main entries, use 
references, modifiers 1, modifiers2, and two and three word 
phrases are most important clues in predicting indexer 
a ssignme nt «, 



There are also several clues which rank high in the 
regression for all indexers, but which have quite different 
rankings when the engin st: and librarian regressions are 
ccmpa.red although no test of statistical significance was 
made. These clues are stemmed header terms in the title, the 
mai n term header in the abstract and the use term header in 
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the abstract. Th<s stemmed header in the title is rated low 
by the librarians and high by the scientists; the main term 
and use term headers in the abstract are rated high by the 
scientists and low and mid-range respectively by the 
librarians. Apparently the header word of a descriptor is 
treated differently by the librarians and scientists. Note 
that there are no header clues in the top group agreed upon 
by all indexers as important. 



let us contrast the clue ranking c£ the regression model 



with that given by the Boolean model. The best four Boolean 



equations are given below. 

(HN 2A and ST 2A) ■ OB (HN HA AND MN M2T) OB 
HA) 

(MN 2A AND ST 2-A) OB |MN HA AND ?1N M2T) OB 
HA) 

(MN 2A AND ST 2 A) OB (MN HA iiC HN M2T) OB 
HA AND DS H2A) 

(MN 2A AND ST 2A) OB (HN HA AND MN M2T) OB 
HA AND DS H2A) 



(US 

(DS 

(DS 

(US 



M2T AND MN 



M2T AND ST 



H2T AND MN 



M2T AND ST 



Where : 



MN 


2A 


is 


a 


two - 


word main 


term in the abstract 


ST 


2A 


X s 


a 


^.wo- 


word stem 


in the abstract 


MN 


HA 


is 


3 


main 


term header in the abstract 


HN 


H 2T 


is 


a 


main 


term modifier2 in the title 


DS 


M2T 


is 


a 


use 


reference 


modifier2 in the title 


DS 


M 2A 


is 


a 


use 


reference 


mod±fier2 in the abiyi-i-u.ct 
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Each of fhe top equations contains the same tvjo ANDed 
terms plus a third term which is variable. The clue types 
mentioned in all of the equations are main entries# stems or 
use ref e rence sSi. Narrower# broader and related terms do not 
serve as good clues in the Boolean equations. 

The first of the ANDed expressions is cx very simple 
requirement. Tf the descriptor has two words, then it must 
appear# as a phrase# in the abstract of the document if the 
descriptor is to be assigned. (Necall that there were no 
three-wbrd abstract or two-word title occurrences (see Fi.gure 
5.08) so that these clues did not occur iti high enough 
numbers to be represented in the final equation.) Of course, 
the stem of any term occurs whenever the term itself occurs 
by the clue definition rules in Section 2^5.1. The result is 
consistent with the rtisults of the regression model where 
two-word main terms in the abstract account for a large part 
of the final value of the regressic; coefficient. 



The second ANDed expression represents a second way to 
recognize a two- word phrase. . The header for the descriptor 
and the modifier2 for that descriptor must be present in the 
document. Since most descriptors are two-word phrases# this 
is simply another way of saying that the words of the p*.rase 
must be present in the document. 



The third ANDed expression is variable# but always 
contains OS M2T and either HK HA or ST FA. In two of the 
equations# US H2A is an additional clue. Again E'^N HA and ST 




122 



1 1 6 



HR are almost. equivalent^ so this last AKDed ejtpression 
bec-omes: (mN HR and US H2T (and sometimes OS M2A)). Rn 

inspection of the use references and the main terms vjhen 
these clues occur shows that in many cases the modifier2 for 
the main term was the same as the mcdifier2 for the use-for 
reference. For example; ’optical instruments* use 'optical 
measurements*. Hence this AHEed expression once again 
reduces to; find the two-wcrd descriptor phrase in the 



In summary p two- word phrases account for the largest 
amount of indexing behavior. Some potentially valuable clue 
lengths,. such as three-word terms, do not occur att all or in 
large enough numbers to make possible a decision about their 
value. Main, use and stemmed terms are the most important 
thesaurus relations. In general, broader, narrower and 
relavted terms from the thesaurus are not very useful in 
accounting for indexing behavior. Header terms are rated 
differently by the two sub-samples, but are not important for 
the entire samp e. linially, no generalizations can be made 
about the relative importance of title and abstract clues in 
accounting for indexer performance. 
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6.4 Individual Documents and Indexers 



The regression- coefficients obtained for five randomly 
selected documents in- the sample were most interesting. The 
pertinent information, is summarized below. 

Document Number 



1 


2 


6 


lh 


20 




experimental events: 622 


622 


622 


622 


622 




signif, clue types: 37 


29 


36 


2 5 


40 




correlation coeff.s ,6412 


* 6574 


,7228 


,7868 


, 90 


84 


.4111 


., 7351 


. 5224 


.61 90 


. 82 


52 


lower 99% conf. int. : ,5760 


. 8274 


.6695 . 


. 74 40 


. 88 


85 


upper 99% conf. int,: ,6983 


. 8 825 


.7687 


.8232 


.92 


49 


most important clues: MN HA 


MN 2A 


EL 2A 


US 2A 


MN 


2T 


DS 2A 


US MIR 


EL MIT 


HN HA 


US 


2R 


HN H2A MH: H2T 


MN MIT 


M N HT 


tJE 


MIT 


HN 2h 


.BE MIT 


HN 2A 


BE HA 


ST 


HA 


The significance- of a clue 


type depend 


s upon 


its cont 


tion 


to the total , regression. 


R clue 


type 


was 


consid 


ered 


significant if it had ,a 


correlation 


1 of at 


least 


.0001 


with 


the dependent variable. 












Note that the most 


important 


clue 


types 


and 


the 


correlation coefficients 


vary widely from 


document 


to 


document. In gensral^ 


the correlation 


coefficient 


is 



considerably higher for an individual doc iment than it is for 
the sample as a ^ whqle. This . means that the regression 








coefficient for all th® documents is very much a compromise* 
The compromise lo^sers the overall coefficient because clues 
which work well on some documents don*t work well on others. 
As we noted in Section 3,5, this fact decreases the 
predictive value of this model. 

Separate regre.ssion runs were made on the irndexing of 
four of the subjects. Some details of these runs are 
summarised below. 



Indexer Number; 

,_6 7 ±1 



experimental events: 


637S 


6379 


6379 


6379 


signif, clue types: 


43 


48 


47 


48 


correlation coeff . : 


. 34 87 


. 4646 


.3120 


. 2799 


■R2: 


« 1216 


.2158 


. 0974 


.0 783 


lower 9g% conf. int. ; 


. 3200 


.4389 


. 2825 


.2 49'= 


upper 99 ^T, conf. int, • 


. 3768 


. 4896 


. 3409 


. 3094 


For each indexer 


MN 2A 


was the top 


clue, accounting for 


70, 78, 64 and 59% 


respectively of 


the 


correlation 



coefficient for the four indexers. After this clue, however, 
there were substantial diffetences among the top group of 
clues in the regression. The uniform use of MN 2A as the 
most important due probably accounts for the top ranking of 
that clue in fch j over-all regression runs. 

The. fact that indexer exhibited a low correlation 
coefficient as an individual, while single documents had high 
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correla-tion coef f icient.s indicates that there tends to be a 
coramon reaction to a single document, but that averaging 
across documents tends to decrease the correlation 
coefficient because the average is a compromise in clue 
styles among the documentso Individua” indexers tend to be 
less predictable than an indexer group because one personas 
idiosyncrasies are not averaged vrith another’s 
idiosyncrasies. 
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6«S Some Suggestions for Further Research 

This dissertation concentrated on textual clues to the 
exclusion cf other types of clues (such as syntactic) . 
Further investigation of other types of clu ^s might help 
explain the existence of distinctive clue styles in 
individual documents. When these styles can be recognized 
from information about the document itsel.fr we will have a 
better understanding of how an indexer goes about indexing. 

Although the Boolean model is of much interest^ a 
shortage of computer time prevented its full development. 
Further research might uncover practical improvements to 
speed up or to simplify the ANDing and OHing programs so that 
a more extensive development of this model could be made. 

Our research was limited to an exploration of twenty 
documents in the rather narrow subject field of 
instrumentation. Since variations in indexing style are to 
be expected across subject fields, it would be interesting to 
build similar models in other subject fields and to compare 
the results. 

Neither the regression nor the Boole a combinatorial 
iUodels could be considered very accurate models of human 
indexing. However, as can be seen from Figure 5.07, humans 
themselves don't agree as to which index terms should be 
assiijned. Tnaccuraite though these models are, it would be 
interesting to use them predictively and to ask humansj 
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how 
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they rated the indcxincj derived from this mecharical source. 
Perhaps these models produce indexing no worse than a 
human’ s. 



There is an implied "theory of the indexer" in this study 
which assumes that the indexer can be modelled by some 
combination of textual clues- The object of the 
investigation was to find out which clues were most important 
and how much of the indexing they accounted for. This is a 
very elementary theory of how indexing proceeds, A future 
study could begin to lay down a much more sophisticated 
theory of the i-ndexer with some of the evidence available 
from this dissertation. For instance, two-word terms seem to 
be the most dependable for purposes of prediction. Suppose 
we start with a model to predict just two-word terms. We 
might say that if the term under consideration is a two-word 

stemmed version of that term 



t erm. 


t hen 


if that term, or 


if a 


is in 


the 


document, then 


the 



Furthex: elaboration of this simple flowchart model could be 
tested against the actual index terms assigned until some 
reasonable fit occurred. We could then test this flowchart 
model against other indexers to learn how accurate and 
complete it is. 
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6.01 Clui 5 Types Osed in ihe T\ro Models and -the 
Acronyms Used for Them 

descri p-t ion 



acronym 


MN 


3T 


MN 


3A 


MN 


2T 


MN 


2A 


MN 


HT 


MN 


HA 


MN 


M2T 


MN 


M2A 


MN 


MIT 


MN 


MIA 


ST 


3T 


ST 


3 A 


ST 


2T 


ST 


2 A 


ST 


HT 


ST 


HA 


ST 


M 2T 


ST 


M2 A 


ST 


H IT 


ST 


MIA 


US 


3T 


US 


3A 


US 


2T 


US 


2A 


US 


HT 


US 


HA 


US 


H2T 


US 


M2A 


US 


MIT 


US 


M 1A 


BN 


3T 


BN 


3 A 


BN 


2T 


BN 


2 A 


BN 


HT 


BN 


HA 


BN 


M2T 


BN 


M2A 


BN 


M IT 


BE 


MIR 



thrs^e-word main descriptor entry in title 

three-word main entry in abstract 

two-word main entry in title 

two- word main entry in abstract 

header word In title 

header word In abstract 

modifier word of main entry in title 
modifier 2 
modi f ier 1 
modi fieri 



or 61 


of 


main 


entry 


in 


word 


of 


main 


entry 


in 


word 


of 


main 


entry 


in 


word 


of 


main 


e ntry 


in 



three-word stem descriptor 
three-word stem descriptor 
two- word stem in title 
two-word stem in abstract 
header stem in title 
header stein in abstract,. 
modifier2 word of 
word of 
word of 
word of 



in title 
in abstract 



modifier 2 
modifier 1 
modi fieri 



stem 


in 


title 


stem 


in 


abstract 


stem 


in 


title 


stem 


in 


abstract 



three-word use 
three-word use 
two-word use 
two-word use 



reference in title 
reference in abstract 
reference in title 
reference in abstract 



header of 
header of 
modif ier2 
modif ier2 
modi f ier 1 
modif ier 1 

three -word 
three-word 
two-word b 
two-word b 
header wor 
hes.der wor 
modif ior2 
modif i.er 2 
modif ier 1 
mO'da.f ier 1 



use reference in 
use reference in 
word 
word 
word 
word 



t itle 
abstract 



of 


use 


reference 


in 


title 


of 


use 


reference 


in 


abstract 


of 


use 


reference 


in 


t itle 


of 


use 


reference 


in 


abstract 



broader term in title 
broader term in abstract 
reader term in title 
reader term in abstract 
d of broader term in title 
d of broadei: term in 
werd of broader term 
word of broader term 
word of broader term 
word of broader term 



abstract 
in title 
in abstract 
in title 
in abstract 
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52 
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Clue Types Used in the Two Models and the 
Acronyms Used for Them (continued) 



acron ym 



descripti on 



NB 3T 
NR 3A 
NR 2T 
NR 2A 
NR HT 
NR HA 
NR M2T 
NR M2A 
NR HIT 
NR N1A 



three-word narrower term in title 
three-word narrower term in abstract 
two-word narrower term in title 
two-word narrower term in abstract 
header of narrower term in title 
header of narrower term in abstract 
modifier 2 of narrower terra in title 
modifier2 of narrower term in abstract 
modifierl of narrower term in title 
modifierl of narrower term in abstract 



RL 3T 
RL 3 A 
RL 2T 
RL 2 A 
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Figure 6.04 Fela-fcive Tmportcincfe of Clue Types for 

Engineer and Scienfisf Indexers 
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Figure^ 6,05 The First Five degression Equations 
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A ppendi:<(:. Changes to lovin*s Stemming Procedures 

This appendix summariaes only the changes and additions 
to the :stems„ codes and rules proposed by Lovin '15S6>, The 
original paper should be consulted for a complete description 
of the procedures and tables. These changes were requited by 
the vocabulary of the documents and thesaurus used in this 
thesis. The effect of each proposed change to lovin’s 
procedures was tested on Brown's "'Normal and Reverse English 
Word List" (1963) to guarantee that the intended change was 
not a parochial one. 

The procedure used in Section 5.2.3 for stemming document 
and thessiurus words is dependent upon a table of stems, a set 
of condition codes and a group of recoding rules. A word to 
be stemmed is compared with the table cf stem endings. The 
object is to obtain the longest possible match between the 
end of a word and, an ending in the table. With each ending 
in the table is an associated "condition code". This code 
specifies the conditions to be met for that particular stem. 
An ending is rejected if the conditions for that ending are 
not met. If the ending passes the condition code test, the 
remaining stem is subjected tc the recoding rules to- 
standardize spelling variations. 

The followivig three tables give the changes and additions 
to the endings, condition codes and recoding rules used by 
Lovin. 
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and Changes to Endings and Condition Codes 
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Additions and Changes to the Condition Codess 



E not after *e* unless *gr* precedes *e* 

H only after *t*,/ 'll* or ^r* 

L do not remove eifter *u*,*x***s’ unless *s* follows 
•o* and minimum stem length is 3 
E minimum stera length is 3 and remove only after 
’u’ or *r’ 

V remove only after *c* or *r* 

DD remove only after «d*,*z*,*t*r*r***h‘p*w*^»g*r*l*» 
except af er *met* 

EE do not remove after '’u* or *e* 



Additions ges to Recoding Rulesr 



5a 


change 


•ript* to 'rib* 






1 5 


change 


•ex* to *ec* except 


after *1* 


2 4 


change 


•end* to *ens* except 


after *s* 


31 


change 


•ert* to *ers* except 


after *v* 
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chan ge 
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change 


•mart* to *mar« 
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